T. Bressoud, D. WhiteIntroduction to Data Systemshttps://doi.org/10.1007/978-3-030-54371-6_21

21. Interlude: Client Data Acquisition

Thomas Bressoud¹ and David White¹

(1)

Mathematics and Computer Science, Denison University, Granville, OH, USA

Chapter Goals

Upon completion of this chapter, you should understand the following:

The difference between text strings and byte strings, and between text files and binary files.
The concept of encoding and decoding to get back and forth between text and byte representation.
The importance of overlaying a file-like view on a set of data acquired over a network.

Upon completion of this chapter, you should be able to do the following:

Access an HTTP result, both as raw bytes and as a character string.
Be able to change the encoding for interpretation of the raw bytes as a string.
For these collections of binary data bytes and string-based result data, overlay a file-like interface to enable better access to the data.
Combine the above to be able to transform the data into the structures for the different formats of data (CSV, JSON, XML) explored in Part II of the book.

In Chap. 20, we learned the syntax of HTTP, and how to use the requests module to issue GET and POST request messages that allow us to obtain results from a web server. In this chapter, we explore variations of taking the results and transforming the data into an in-memory structure usable in our client programs.

In Chaps. 6 and 15 and in Sect. 2.4, we used local files as our data source to get CSV-, XML-, and JSON-formatted data into pandas, into an element tree, and into a dictionary/list composite structure, respectively. Now this same format data is being acquired over the network and arrives at our client application in the body of an HTTP request. We need to process and obtain the same in-memory structure of a pandas data frame from CSV-formatted data, an lxml element tree from XML-formatted data, or the Python data structure from JSON-formatted data. To do this correctly, we will often require understanding of the encoding, by which the characters of the data at the server are mapped to the bytes transmitted over the network.

21.1 Encoding and Decoding

Recall from Sect. 2.2.2 that the term encoding (aka codec ) defines a translation from a sequence of characters (i.e., a string or text) to the set of bytes that are used to represent that character sequence. Given an encoding, then, for each character, there is a specific translation of that character into its byte representation. Some encodings are limited in the set of characters that they are able to encode, and thus the alphabets and languages they can support. Other encodings allow the full Unicode character set as their input and can adapt to many alphabets and languages. In the reverse direction, given a set of bytes representing a character sequence, along with knowledge of the specific encoding used, we use the term decoding for the process of converting a sequence of bytes back into its original character sequence.

In Table 21.1, we list some of the most common encodings. Python documentation provides the set of encodings supported [50].

Table 21.1

Common encodings

Encoding	Bytes per Char	Notes
ASCII	Exactly 1	One of the most limited encodings, only supporting the English alphabet, and including A-Z, a-z, 0-9, and basic keyboard special characters
UTF-8	1 to 4, but using 1 whenever possible	Supports full Unicode but is backwardly compatible to ASCII for the one byte characters supported there. This encoding accounts for 95% of the web
UTF-16	2 to 4, with 2 the most dominant	Supports full Unicode and is used more on Windows platforms, where it started as always 2 bytes, until it needed to be expanded. Variations include UTF-16BE and UTF-16LE that make explicit the ordering of multibyte units
ISO-8859-1	Exactly 1	Latin character set, starting from ASCII, but adding common European characters and diacritics. Also known as LATIN_1

Suppose we have the character sequence "All the world's a stage …" and are using the specific encoding of UTF16-BE for the mapping of characters to bytes. Figure 21.1 illustrates the encoding and decoding process. Given the original string and the encoding of UTF-16BE, the encode operation translates into a byte sequence. In the figure, we represent the byte sequence using hexadecimal digits (0-9 plus a through f), where each pair of hex digits is equivalent to exactly one byte. So the A character maps to two bytes, written in hex as 0041; lower case l maps to hex 006c, and we see these bytes repeated, and so forth. When we are given a collection of bytes, like in the middle of the figure, then the decode operation , with knowledge of the correct encoding of UTF-16BE, can translate the bytes back into the original character sequence.

../images/479588_1_En_21_Chapter/479588_1_En_21_Fig1_HTML.png — Fig. 21.1
Encoding and decoding process

Relative to this book, there are three contexts within which we should be aware of encoding. First, within our Python programs, we may have character strings from which we need to explicitly generate a raw bytes representation, or vice versa. Second, every local file is actually stored as a sequence of bytes. If it is a text file, we need to ensure that, whether reading from a text file or writing to a text file, we are able to specify a desired encoding as appropriate. Third, when we are acquiring data over the network, the body of an HTTP response message is also conveyed over the TCP reliable byte-stream as a sequence of raw bytes. So we want to be able to decode those bytes into their original character sequence when the contents of the message are, in fact, text.

We address the first context, encoding and decoding explicitly in our Python program, here in this section. Encoding and decoding when interacting with files have already been covered in Sect. 2.2.2 and will not be repeated here, but we will give file-based examples in Sects. 21.2 through 21.4. The third context is directly related to the main goals of this chapter and will also be illustrated in Sects. 21.2 to 21.4.

21.1.1 Python Strings and Bytes

The characters of Python strings allow for the full Unicode character set and thus can support a spectrum of alphabets and languages. Further, Python, by default, uses the UTF-8 encoding. These defaults of character set and encoding are often sufficient when our programs are not interacting with text data from outside sources. In this case, we rarely have to do explicit translations, or to specify encodings as we open and use files. But when text data originates from some outside source, and if that outside source might use a different encoding, we must have the tools to interpret the data.

In Python, we represent sequences of characters as a string data type, and the type name is str. Individual characters do not have a separate type and are represented as str whose length is one. Python also has a class bytes that is used to represent a set of raw bytes. This data type can be used both for the result of an encode( ) operation on a string and for non-string types of binary data. A Python value of the bytes data type can be created using a constant syntax similar to that of strings, but prefixed with a b character. For instance, b'Hello!' defines a bytes value that is the UTF-8 encoding of the string that follows the b character. But this value should not be mistakenly thought of as a str value.

21.1.1.1 The Encode Operation: A String to Bytes

Our first two examples show starting from a Python string, which could be any valid Unicode string in a Python program, and using the encode( ) method of the string type to perform the encoding operation and yielding a bytes result. We print the type of the original string and the type of the encoded value. The bytes class has a method, hex( ) , that can display the hex byte sequence for its data, and we use this to help convey the raw data bytes from the encoded value referenced by b16.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figa_HTML.png

| <class 'str'> <class 'bytes'> 00480065006c006c006f0021

The next example is functionally the same but instead uses the UTF-8 encoding.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figb_HTML.png

| <class 'str'> <class 'bytes'> 48656c6c6f21

If we are dealing with a single character, Python provides a built-in function, ord( ) for obtaining an integer value corresponding to the encoding of that character based on the default UTF-8 encoding. In the example below, we obtain the encoding of the 'H' character as well as the Unicode Euro symbol. As shown through the output, the data type of the result is an int. We print the integer value of the character’s encoding and also use the built-in function hex( ) to show the results in the more familiar hexadecimal byte representation.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figc_HTML.png

| <class 'int'> 72 0x48 <class 'int'> 8364 0x20ac

We can also encode a string s into byte form using b = bytes( s, encoding = 'ascii') , which returns a byte version of the string (e.g., if s = "Ben" then b = b"Ben").

The important takeaway through all these examples is that encoding is operating on a string or a character (type str), and the result is one or more bytes (type bytes).

21.1.1.2 The Decode Operation: Bytes to a String

The decode operation translates a byte sequence back into its original sequence of characters, based on the encoding. As long as the encoding used in the decode operation is the same one used by a prior encode operation, the resulting string will be the same one we started with. In similar fashion to the examples above, we can see decode for a multiple character sequence or for a single character.

From our examples above, b16 is a bytes value resulting from a UTF-16BE encode operation, and b8 is a bytes value resulting from a UTF-8 encode operation. The bytes class has a decode( ) method , whose argument is the encoding/codec used in the prior encode operation. If no argument is given, the default encoding is used.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figd_HTML.png

| s16: Hello! s8: Hello!

We see that although b16 and b8 were clearly not the same bytes value, after decoding, s16 and s8 have the same original character sequence.

The Python built-in function chr( ) performs the single-character reverse of the ord( ) function. Its argument is an int, and the result is the UTF-8 decode of the value. We demonstrate using b1 and b2, integers obtained using ord( ) in the above example.

../images/479588_1_En_21_Chapter/479588_1_En_21_Fige_HTML.png

| s1: H s2:

We can also use the conversion capability of the str( ) function, with a first argument that is the encoded bytes value, and can specify an encoding= named parameter to control the conversion to use the specified encoding. So str( b16, encoding='UTF-16BE') yields the same result as b16.decode( 'UTF-16BE') .

The important takeaway through these examples is that decoding is operating on a bytes or int value, and the result is a string (type str).

21.1.2 Prelude to Format Examples

Sections 21.2 through 21.4 will focus on each of the primary formats of CSV, JSON, and XML, with the goal of demonstrating translations of both local files and HTTP response messages into structures usable in our client applications.

In local files, we often know the encoding, but when data is retrieved through the body of an HTTP response message , we cannot assume that the encoding will be ASCII, or UTF-8, or ISO-8559-1. It is important to remember that the bytes of the data files are reaching us via a byte-stream (TCP), which does not mandate textual data, nor require a particular encoding. Furthermore, many different encodings are possible, such as UTF-16BE, which stands for “big endian 16 bits,” which results in two raw bytes per character to be encoded. We will discuss this further in the sections below.

For illustration, we will use the following files, both locally, and as retrieved over the network, for our examples.

It will be important to keep in mind the distinction between strings and byte strings, and between text files and binary files, in the discussion to come. We now discuss a sequence of vignettes, with how to acquire data in the formats CSV, JSON, and XML.

The network examples in Sects. 21.2 to 21.4 will all make requests from the book web page, https://datasystems.denison.edu, with resource paths specifying the files in Table 21.2.

Table 21.2

Example files in various formats and multiple encodings

UTF-8 Encoded	UTF-16BE Encoded	Description
ind2016.csv	ind2016_16.csv	Six country indicators from year 2016 in CSV format
ind0.json	ind0_16.json	Indicator data of pop and gdp for three countries for two different years in JSON format
ind0.xml	ind0_16.xml	Indicator data of pop and gdp for three countries for two different years in XML format

As we work with the body of HTTP response messages, it will be important to keep in mind the distinction between strings and binary data as a bytes data type . The requests module gives us two ways to extract data from the body of a response, using the attributes of a Response object:

Response.content: the raw bytes version of the data,
Response.text: the decoded translation of the raw bytes into a sequence of characters.

The latter uses an assumed/inferred encoding, which can be found through the attribute Response.encoding. We will give examples of extracting this information so that it can be used by our code. We will also show how to read both types of data (i.e., either the text data from Response.text or the underlying byte data from Response.content).

In common amongst the set of network examples, we use our custom util.buildURL( ) function to construct a string url prior to each requests invocation. This, along with helper functions to print results, is documented in Appendix A, in Sect. A.1. For buildURL( ) , we specify the desired resource-path in the first argument and the host location in the second argument.

21.1.3 Reading Questions

21.1

How many characters can be encoded in ASCII, and what are some examples of “basic keyboard special characters”? Use a web search to answer this if you have never seen ASCII before.

21.2

What does it mean that UTF-8 is “backwardly compatible” to ASCII?

21.3

How many characters can be encoded with ISO-8859-1 and is it backwardly compatible with ASCII? You are encouraged to use a web search if this is your first exposure to ISO-8859-1.

21.4

The reading points out that “every local file is actually stored as a sequence of bytes.” Have you ever had the experience of trying to open a local file in a program (e.g., a text editor), and seeing something very strange display? This probably had to do with your computer using the wrong decoding scheme. Describe your experience when this happened.

21.5

Is the body of a response to an HTTP GET always text? Justify your answer or give a counterexample.

21.6

What do you notice about the hex byte sequence representations for the two “Hello!” encode( ) examples? Why is this?

21.7

Conceptually, why do you think there are so many different encoding schemes, and why is it important to keep them compatible?

21.8

As the files ind2016.csv and ind2016_16.csv are stored on the book web page, please go and download them, then try to open them in the most naive program possible (i.e., a simple text editor). Describe what you see.

21.9

Please refer back to an HTTP GET request you made in the previous chapter using the requests module and use the .content and .text attributes to look at the data. Investigate these two quantities using print( ) and type( ) and report what you find.

21.10

Please refer back to an HTTP GET request you made in the previous chapter using the requests module and determine the encoding using the .encoding attribute of the response. What did you find?

21.1.4 Exercises

21.11

One of the reasons for the existence of Unicode is its ability to use strings that go beyond the limitations of the keyboard. Relative to the discussion in the chapter, Unicode is about the strings we can use in our programs, and the issue of how they translate/map to a sequence of bytes (i.e., their encoding) is a separate concept.

When we have the code point (generally a hex digit sequence identifying an index into the set of characters) for a Unicode character that is beyond our normal keyboard characters, we can include them in our strings by using the ∖u escape prefix followed by the hex digits for the code point. Consider the Python string s:

s = "Unicode examples: \u2B2C and \u266A and \u1F60 and " \

"\u265E and \u0394 and \u0402"

Write code to print s, then assign to b8 the UTF-8 encoding of s, and b16 the UTF-16BE encoding of s. For each, use the hex( ) method of the bytes data type to see a hex version of the encoded values. Answer the following questions:

which of the hex representations is longer?
give explicit lengths for b8, b16, and for the two hex( ) transformations.
how does this compare to the length of s?

21.12

Write a function

shiftLetter(letter, n)

whose parameter, letter, should be a single character. If the character is between "A" and "Z", the function returns an uppercase character n positions further along, and “wrapping” if the + n mapping goes past "Z". Likewise, it should map the lower case characters between "a" and "z". If the parameter letter is anything else, or not of length 1, the function should return letter.

Hint: review functions ord( ) and chr( ) from the section, as well as the modulus operator % .

21.13

Building on the previous exercise, write a function

encrypt(plaintext, n)

that performs a shiftLetter for each of the letters in plaintext and accumulates and returns the resultant string.

21.14

Write a function

singleByteChars(s)

that takes its argument, s, and determines whether or not all the characters in s can be encoded by a single byte. The function should return the Boolean True if so, and False otherwise.

21.15

Suppose you have, in your Python program, a variable that refers to a bytes data type, like mystery refers to the bytes constant literal as given here:

mystery = b'\xc9\xa2\x95}\xa3@\x89\xa3@\x87\x99\x85'\

b'\x81\xa3@\xa3\x96@\x82\x85@\xa2\x96\x93'\

b'\xa5\x89\x95\x87@\x97\x99\x96\x82\x93\x85'\

b'\x94\xa2o@@\xe8\x96\xa4@\x82\x85\xa3Z'

Perhaps this value came from a network message, or from a file. But you suspect that it, in fact, holds the bytes for a character string, and you need to figure out how it is encoding. Assume that you have narrowed the encodings down to one of the following:

“UTF-8,”
“UTF-16BE,”
“cp037,”
“latin_1.”

Write code to convert the byte sequence to a character string and determine the correct encoding.

21.2 CSV Data

We have seen that the CSV format is ubiquitous in the world of tidy data. A working data scientist will often work with both local CSV files and CSV files obtained over a network.

21.2.1 CSV from File Data

Suppose we are given a local csv file, ind2016.csv. We have previously seen how to read this file into a pandas data frame:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figf_HTML.png

| code country pop gdp life cell

| 0 CAN Canada 36.26 1535.77 82.30 30.75

| 1 CHN China 1378.66 11199.15 76.25 1364.93

| 2 IND India 1324.17 2263.79 68.56 1127.81

| 3 RUS Russia 144.34 1283.16 71.59 229.13

| 4 USA United States 323.13 18624.47 78.69 395.88

Now consider the file ind2016_16.csv. In this file, the same sequence of characters is encoded as UTF16-BE. This is still a text file. It just has a different mapping from the characters to the bytes of the file. If we attempt to read into a pandas data frame, the operation seems to complete. However, when we look at the head( ) of the data, we see that it has been mangled by the read_csv( ) command:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figg_HTML.png

| Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3

| 0 NaN NaN NaN NaN

| 1 NaN NaN NaN NaN

Fortunately, the pandas read_csv has a named parameter, encoding=, that we can use to specify the true encoding of the file and thus to get the correct results:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figh_HTML.png

| code country pop gdp life cell

| 0 CAN Canada 36.26 1535.77 82.30 30.75

| 1 CHN China 1378.66 11199.15 76.25 1364.93

| 2 IND India 1324.17 2263.79 68.56 1127.81

| 3 RUS Russia 144.34 1283.16 71.59 229.13

| 4 USA United States 323.13 18624.47 78.69 395.88

We see now that the pandas data frame df2 now correctly contains the data.

21.2.2 CSV from Network Data

It is common to receive CSV files over a network, e.g., when sent as an email attachment, or hosted on a website. We have hosted the data sets mentioned above on the book website and now demonstrate how to retrieve them. We begin with ind2016.csv:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figi_HTML.png

This file was encoded as UTF-8. We see below that the encoding assumed by the requests module is ISO-8859-1.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figj_HTML.png

| 'ISO-8859-1'

This assumed encoding means that it is entirely possible to fail at reading a CSV file into pandas when it is received over the network. Care is required. If we look at response.headers['Content-Type'], we get 'text/csv' and this does not, in this case, give more specific information on the encoding.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figk_HTML.png

| 'text/csv'

The set of characters actually used in this file are all in the one-byte (0 to 256) range supported by the ASCII encoding. When this is the case, ASCII, UTF-8, and ISO-8859-1 encodings result in the same bytes for this sequence of characters. If we know (or can find out) the encoding, the better thing to do would be to set the encoding to the correct one, and then can access the “string” version:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figl_HTML.png

| code,country,pop,gdp,life,cell

| CAN,Canada,36.26,1535.77,82.3,30.75

| CHN,China,1378.66,11199.15,76.25,1364.93

| IND,India,1324.17,2263.79,68.56,1127.81

| RUS,Russia,144.34,1283.16,71.59,229.13

| USA,United States,323.13,18624.47,78.69,395.88

| VNM,Vietnam,94.57,205.28,76.25,120.6

If the data presented above were in a file, instead of being in a memory structure of a Response object, we could use our file-based techniques from Sect. 3.4 to iterate over the lines and compose the data into a native Python data structure. Otherwise, we would have to manually extract the data. Further, the ability to layer a file-like view of a set of data where the bytes or characters reside in memory would allow pandas, lxml, and json to perform parsing and interpretation the same way they do for files.

Fortunately, the Python io module has facilities for exactly this purpose: using bytes of data, or bytes making up characters, and constructing a file-like object , that presents the same interface and functionality as we get when we perform an open( ) on a file and obtain a file object. There are two variations, based on whether the in-memory structure is a str or a bytes object.

io.StringIO( ) : takes a string buffer and returns an object that operates in the same way as a file object returned from an open( ) call. Like a file object, this object has a notion of a current location that advances as we read (using read( ) , readline( ) , etc.) through the characters of the object.
io.BytesIO( ) ∖index{BytesIO( ) constructor}: takes a bytes buffer and returns an object that operates in the same way as a file object returned from an open( ) call and opened in binary mode. Like a file object, this object has a notion of a current location that advances as we perform read( ) operations over the bytes of the object.

We will use these constructors, passing either Response.text or Response.content, as appropriate, to allow much easier processing of HTTP response message as we consider CSV as well as JSON and XML parsing and interpretation in the sections that follow.

21.2.2.1 Option 1: From String Text

Suppose response.text contains the data of ind2016.csv, as text data, obtained from requests.get( ) as above. We can use io.StringIO( ) to create a file-like object and explicitly process it into a native Python data structure and then construct a pandas data frame.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figm_HTML.png

| code country pop gdp life cell

| 0 CAN Canada 36.26 1535.77 82.30 30.75

| 1 CHN China 1378.66 11199.15 76.25 1364.93

| 2 IND India 1324.17 2263.79 68.56 1127.81

| 3 RUS Russia 144.34 1283.16 71.59 229.13

| 4 USA United States 323.13 18624.47 78.69 395.88

| 5 VNM Vietnam 94.57 205.28 76.25 120.60

This process can be streamlined by use of pandas built-in functions. In the read_csv( ) data frame constructor, the first argument can be a file object or a file-like object. So we can create the file-like object from the string version of the response and use that as the first argument, with the rest of the benefit in parameter options that come from using read_csv( ) :

../images/479588_1_En_21_Chapter/479588_1_En_21_Fign_HTML.png

| country pop gdp life cell

| code

| CAN Canada 36.26 1535.77 82.30 30.75

| CHN China 1378.66 11199.15 76.25 1364.93

| IND India 1324.17 2263.79 68.56 1127.81

| RUS Russia 144.34 1283.16 71.59 229.13

| USA United States 323.13 18624.47 78.69 395.88

| VNM Vietnam 94.57 205.28 76.25 120.60

Now let us turn to a case where the encoding is UTF-16BE. Again, the file itself is still a text file.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figo_HTML.png

The object response is very similar to the version of response associated with ind2016.csv. For a web server and the HTTP request, there is little difference between one file and another. So we would not expect the assumed encoding to be correct, and indeed it is not:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figp_HTML.png

| 'ISO-8859-1'

If we were to look at the decoded version through response.text, we see a nonsense string, exactly because the decoding was incorrect.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figq_HTML.png

| '\x00c\x00o\x00d\x00e\x00,\x00c\x00o\x00u\x00n\x00t'

To fix this, we set the encoding to the proper value, given our knowledge of how this particular resource was encoded, and we then see an appropriate response.text :

../images/479588_1_En_21_Chapter/479588_1_En_21_Figr_HTML.png

| code,country,pop,gdp,life,cell

| CAN,Canada,36.26,1535.77,82.3,30.75

| CHN,China,1378.66,11199.15,76.25,1364.93

| IND,India,1324.17,2263.79,68.56,1127.81

| RUS,Russia,144.34,1283.16,71.59,229.13

| USA,United States,323.13,18624.47,78.69,395.88

| VNM,Vietnam,94.57,205.28,76.25,120.6

If response.encoding is correct, then response.text will be a correct string containing the textual CSV data. At this point, the same technique, where we use the response.text string and create a file-like object, can do the same things we did in Chap. 6 and with pandas:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figs_HTML.png

| country pop gdp life cell

| code

| CAN Canada 36.26 1535.77 82.30 30.75

| CHN China 1378.66 11199.15 76.25 1364.93

| IND India 1324.17 2263.79 68.56 1127.81

| RUS Russia 144.34 1283.16 71.59 229.13

| USA United States 323.13 18624.47 78.69 395.88

| VNM Vietnam 94.57 205.28 76.25 120.60

We have seen how to read string text into a pandas data frame. We turn now to the case where we use the body from request.get( ) as byte data.

21.2.2.2 Option 2: From Underlying Bytes

In the example above, changes in response.encoding and the resultant difference in response.text did not change the underlying bytes data, available in response.content. While it is more complex, particularly across non-standard encoding, to use the bytes data and direct file type operations to construct a data frame, the pandas read_csv( ) can take its input from a file-like object containing bytes data and can perform the decoding itself.

To demonstrate this across our two different encodings, we GET both the UTF-8 encoded CSV file and the UTF-16BE encoded CSV file and use different response objects for the two results:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figt_HTML.png

When we are dealing with the underlying bytes data, and we want/need a file-like object, we use io.BytesIO( ) to construct the file-like object from the bytes in response1.content and response2.content. We then pass the file-like objects to read_csv( ) and specify the proper encoding:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figu_HTML.png

We see that this procedure succeeds, demonstrating the power of read_csv:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figv_HTML.png

| code country pop gdp life cell

| 0 CAN Canada 36.26 1535.77 82.30 30.75

| 1 CHN China 1378.66 11199.15 76.25 1364.93

| 2 IND India 1324.17 2263.79 68.56 1127.81

| 3 RUS Russia 144.34 1283.16 71.59 229.13

| 4 USA United States 323.13 18624.47 78.69 395.88

| 5 VNM Vietnam 94.57 205.28 76.25 120.60

The value of df2 is identical. Having demonstrated how to read CSV data, we turn now to reading JSON data.

21.2.3 Reading Questions

21.16

The first example involves specifying the correct encoding for ind2016_16.csv. The reading shows what happens if you specify no encoding. What do the results look like if you specify the wrong encoding? Investigate (using the local file ind2016_16.csv that you should have downloaded) with at least three encodings.

21.17

Please carry out the requests.get( csv_url) block of code and experiment with setting different encodings. Try with both naive encodings like ASCII and also more advanced encodings like UTF-16. Also try with a bytes encoding like UTF-16BE. Describe the results of util.print_text( ) in each case.

21.18

Recall that when you open( ) a file you can choose various modes, e.g., for reading versus writing. What mode would you use to create a bytes file?

21.19

In the code to process response.text into a LoL, please explain the purpose of strip( ) , split( ) , and astype( ) . You might want to refer back to earlier chapters.

21.20

Investigate the read_csv( ) method of reading response.text into a data frame. Are the entries floating point numbers or strings?

21.21

In the approach to reading from response.content into a data frame, did we need to set a value for response.encoding? Why or why not?

21.2.4 Exercises

21.22

The purpose of io.StringIO( ) is to create a file-like object from any string in a Python program. The object created “acts” just like an open file would.

Consider the following single Python string, s, composed over multiple continued lines:

s = "Twilight and evening bell,\n" \

"And after that the dark!\n" \

"And may there be no sadness of farewell,\n" \

"When I embark;\n"

First, write some code to deal with s as a string:

determine the length of s,
find the start and end indices of the substring "dark" within s,
create string s2 by replacing "embark" with "disembark".

Now, create a file-like object from s and perform a first readline( ) , assigning to variable line1 and then write a for loop to use the file-like object as an iterator to accumulate into lines a list of the remaining lines, printing each.

21.23

Practice with io.StringIO( ) by using a for loop to print the numbers 1 through 100 into a file-like object, one per line. Then, iterate through this object and confirm that read( ) yields a string representing the entire data, while readline( ) yields one line at a time, and keeps track of the location in the file-like object. Provide your code.

21.24

Repeat the previous problem but with io.BytesIO( ) . Note that you can convert the numbers yielded by your loop into bytes using the bytes( ) function from the previous section. Provide your code.

The next set of exercises involve a file at resource path /data/mystery3.dat on host datasystems.denison.edu. You can assume the file is textual and is a tab-separated data collection where each line consists of:

male_name <tab> male_count <tab> female_name <tab> female_count

for the top 10 name applications of each sex to the US Social Security Administration for the year 2015.

21.25

Suppose the encoding of the file is unknown but will be from one of the following:

“UTF-8,”
“UTF-16BE,”
“UTF-16LE,”
“cp037,”
“latin_1.”

Write code to:

acquire the file from the web server,
ensure the status_code is 200,
assign to content_type the value of the Content-Type header line of the response,
determine the correct encoding and assign to real_encoding,
set the .encoding attribute of the response to real_encoding,
assign to csv_body the string text for the body of the response.

21.26

In this question, you will start with a string and create a Dictionary of Lists representation of the data entailed in the string. It is suggested to use the result of the previous problem, csv_ body, as the starting point. But to start independently, you can use the following string literal constant assignment to get to the same starting point:

csv_body = "Noah\t19635\tEmma\t20455\n" \

"Liam\t18374\tOlivia\t19691\n" \

"Mason\t16627\tSophia\t17417\n" \

"Jacob\t15949\tAva\t16378\n" \

"William\t15909\tIsabella\t15617\n" \

"Ethan\t15077\tMia\t14905\n" \

"James\t14824\tAbigail\t12401\n" \

"Alexander\t14547\tEmily\t11786\n" \

"Michael\t14431\tCharlotte\t11398\n" \

"Benjamin\t13700\tHarper\t10295\n"

Construct a file-like object from csv_body and then use file object operations to create a dictionary of lists representation of the tab-separated data. Note that there is no header line in the data, so you can name the columns malename, malecount, femalename, and femalecount.

21.27

Use pandas to obtain a data frame named df by using a file-like object based on csv_body and use read_csv( ) . Name your resultant data frame df. Make sure you have reasonable column names.

Be careful to call read_csv so that the separators are tabs, not commas.

21.3 JSON Data

Recall from Chaps. 2 and 15 that JSON is a light-weight format for transmitting simple data types.

21.3.1 JSON from File

When we acquire JSON through a file, we use the json.load( ) function. This function uses an open file object (or file-like object) as an argument. Therefore, to deal with a different encoding, we simply need to specify the encoding as we open the file.

We start by showing what happens when we fail to do this after acquiring data encoded in a non-default UTF-16BE:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figw_HTML.png

../images/479588_1_En_21_Chapter/479588_1_En_21_Figx_HTML.png

| Exception encountered in JSON decode

The load( ) operation raised an exception since it was unable to decode the data. Now we create a file object and specify the correct encoding and voila, things work as we need them to:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figy_HTML.png

../images/479588_1_En_21_Chapter/479588_1_En_21_Figz_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

| },

| "2017": {

| "pop": 66.87,

| "gdp": 2586.29

| }

21.3.2 JSON from Network

In the following examples, we obtain from the web server files with JSON as the body data. In response1, we have UTF-8 encoded data. In response2, we have UTF-16BE encoded data.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figaa_HTML.png

We next show how to get from response1 and response2 to in-memory data structures.

21.3.2.1 JSON from String Data in Response

In common with the examples above, when we want to use the .text (string) version of the response, we must get the encoding right. We do this for both response1 and response2, at which point the character string version of the two responses is valid, and we can use a variety of techniques to go from a string into a JSON-based data structure. Since the latter steps are the same after we get the encoding right, we just run through examples using response1.text.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figab_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

| },

| "2017": {

| "pop": 66.87,

| "gdp": 2586.29

| }

../images/479588_1_En_21_Chapter/479588_1_En_21_Figac_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

| },

| "2017": {

| "pop": 66.87,

| "gdp": 2586.29

| }

We see that as soon as the correct encoding is specified, the field response.text is legible. We are ready to read the data into memory.

Option 1

Use json.loads( ) , which takes a string and returns the in-memory data structure.

Given a JSON-formatted string s, the built-in function loads( s) , in the json package of Python, returns the data structure encoded. For instance, if s represents a JSON array, then a Python list is loaded, and if s represents a JSON object, then a Python dictionary is loaded. We demonstrate:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figad_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

| },

| "2017": {

| "pop": 66.87,

| "gdp": 2586.29

| }

We achieve a Python dictionary in memory.

Option 2

Create a file-like object, and then use json.load( ) .

Given a JSON file, the load( ) function returns the data structure encoded. Just as we did with CSV files, we can use StringIO to produce a file-like object, which we can feed to json.load( ) as follows:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figae_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

| },

| "2017": {

| "pop": 66.87,

| "gdp": 2586.29

| }

We achieve a Python dictionary in memory.

Option 3

Use requests .json( ) method of a response object.

Lastly, the requests module has built-in functionality for JSON files, because of their ubiquity. The following shows how to read directly into a Python dictionary from the HTTP response received.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figaf_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

| },

| "2017": {

| "pop": 66.87,

| "gdp": 2586.29

| }

In all three of these examples, we have been provided JSON data in string form. We consider now the case of JSON data in byte form.

21.3.2.2 JSON from Bytes Data in Response Body

Because of its alternate encoding resulting in a different set of bytes for the sequence of characters, we use the bytes data of response2 in our examples demonstrating bytes data conversion into JSON-derived data structure.

A Request for Comments (RFC) documents a given Internet Standard. The RFC standard for JSON explicitly allows all three of UTF-8, UTF-16, and UTF-32 to be used in data formatted as JSON. This means that the json module will recognize the bytes data directly, as if it were already a decoded string, greatly simplifying our lives.

Option 1

Use json.loads( ) , which takes bytes data in UTF-8, UTF-16, or UTF-32 and returns the in-memory data structure.

Analogous to the situation of text data, we can feed the built-in function json.loads( ) byte data rather than text data:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figag_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

The result is still a Python data structure in memory.

Option 2

Create a bytes file-like object and then use json.load( ) .

Similarly, we can feed the json.load( ) function a byte file instead of a text file. We use BytesIO to get from the byte data response to a file-like object.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figah_HTML.png

| {

| "FRA": {

| "2007": {

| "pop": 64.02,

| "gdp": 2657.21

The result is a Python data structure in memory. Having demonstrated how to acquire JSON data, we turn to XML data.

21.3.3 Reading Questions

21.28

Please download ind0_16.json as a local file and experiment with setting different encodings in open( ) , and the try/except block of code given. Explain what happens.

21.29

In the previous question, we explored errors associated with the JSON load( ) function when the encoding is wrong. Please do the same now with the loads( ) function and describe what happens when the encoding is wrong. You may use the code provided to read JSON data from the book web page.

21.30

In JSON Option 2 for response.text, does this assume the encoding has already been specified?

21.31

In JSON Option 3 for response.text, does this assume the encoding has already been specified?

21.32

When extracting JSON from bytes data, do we need to specify response.encoding before applying json.loads( ) to response. content?

21.33

Why were there three options for extracting in-memory data structures from string data but only two options for bytes data?

21.3.4 Exercises

In many of the following exercises, we will show a curl incantation that obtains JSON-formatted text data from the Internet. Your task will be to translate the incantation into the equivalent requests module programming steps, and to obtain the parsed JSON-based data structure from the result, assigning to variable data. In some cases, we will ask for a specific method from those demonstrated in the section.

21.34

Using any method, get the JSON data from school0.json:

curl -s -o school0.json \

https://datasystems.denison.edu/data/school0.json

21.35

Using the bytes data in .content, a file-like object, and json.load( ) , get the JSON data from school0.json.

curl -s -o school0.json \

https://datasystems.denison.edu/data/school0.json

21.36

Write a function

getJSONdata(resource, location, protocol='http')

that makes a request to location for resource with the specified protocol, then uses the bytes data in the .content of the response, with a file-like object, and json.load( ) , to get the JSON data. On success, return the data. On failure of either the request or the parse of the data, return None.

21.37

The school0_32.json resource is encoded with utf-32. Use the method of setting the .encoding attribute and then accessing the .text string body, and get the JSON data.

curl -s -o school0_32.json \

https://datasystems.denison.edu/data/school0_32.json

21.38

Repeat acquiring the school0_32.json resource, encoded with utf-32. This time, use the method of using the bytes data in .content, a file-like object, and json.load( ) .

curl -s -o school0_32.json \

https://datasystems.denison.edu/data/school0_32.json

Where in your code did the encoding of “utf-32” come into play? Can you explain why? What does this mean for the getJSONdata( ) function you wrote previously?

21.39

Use any method you wish to obtain the JSON data associated with the following POST request. Make sure you faithfully translate the -H and -d options of the curl into their requests equivalent.

curl -X POST -s -o data/reply.json -d field1='value1' \

-d field2=42 -H "Accept: application/json" \

"https://httpbin.org/post"

21.4 XML Data

Recall from Chap. 15 that XML is a format used for hierarchical data. When an XML file is well formed, we can parse it to map it onto the tree it represents, and can extract the root Element containing the data of the entire tree. This process of turning an XML file into a tree, and finding the root, uses the lxml library and the etree module within it.

In the examples that follow, when we have parsed a tree, we print out the tag of the root Element. Since, for all these examples, the data is the indicators data set, and the tree is structured so that the root Element has tag, indicators, we are successful when this is the result we print. Also, the parse( ) function raises an exception when it encounters a problem, so we place our examples in try-except blocks to help show when such problems occur.

21.4.1 XML from File Data

When we have XML in a file, we have two options for opening and parsing. We can specify the file name, or we can specify a file object. For the former, we first build the relevant path.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figai_HTML.png

We demonstrate specifying a path to the parse( ) function:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figaj_HTML.png

| indicators

Next we demonstrate specifying a file object:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figak_HTML.png

| indicators

This can be done for files of any encoding . Again, we first demonstrate specifying a path:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figal_HTML.png

| indicators

Importantly, the parse function is intelligent enough to figure out the encoding, even if we do not specify it when we open the file. Hence, the parse function is actually doing the decoding in the following block of code:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figam_HTML.png

| indicators

Having reviewed how to open and parse XML files locally, we turn to data obtained over the network.

21.4.2 From Network

In common to the following examples, we obtain from the web server files with XML as the body data. In response1, we have UTF-8 encoded data, and in response2, we have UTF-16BE encoded data.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figan_HTML.png

In both cases, the headers of the response know that the body is in XML format, but do not specify the encoding. The default encoding for a text/xml content type is “UTF-8,” as we can see within the text field of the response:

../images/479588_1_En_21_Chapter/479588_1_En_21_Figao_HTML.png

| text/xml

../images/479588_1_En_21_Chapter/479588_1_En_21_Figap_HTML.png

| <?xml version='1.0' encoding='UTF-8'?>

| <indicators>

| <country code="FRA" name="France">

| <timedata year="2007">

| <pop>64.02</pop>

| <gdp>2657.21</gdp>

| </timedata>

| <timedata year="2017">

| <pop>66.87</pop>

| <gdp>2586.29</gdp>

For the second encoding type, we must specify the encoding before attempting to retrieve the text of the response. Otherwise, it will not render correctly.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figaq_HTML.png

| text/xml

../images/479588_1_En_21_Chapter/479588_1_En_21_Figar_HTML.png

| <?xml version='1.0' encoding='utf-16be' standalone='yes'?>

| <indicators>

| <country code="FRA" name="France">

| <timedata year="2007">

| <pop>64.02</pop>

| <gdp>2657.21</gdp>

| </timedata>

| <timedata year="2017">

| <pop>66.87</pop>

| <gdp>2586.29</gdp>

In both cases, we have achieved XML data in the text field of response. We now discuss how to parse this XML data.

21.4.2.1 Using parse on Bytes

We consider first the case where the data comes to us in byte form. As usual, we use io.BytesIO( ) to create a file-like object. The parse( ) function is intelligent enough to parse such a file, resulting in a properly formed tree.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figas_HTML.png

Even when the data is encoded in a way other than UTF-8, we no longer need to specify the encoding, because the parse( ) function is intelligent enough to decode on its own, as we explained above.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figat_HTML.png

We demonstrate that the two trees returned by parse are indeed as expected, i.e., that the root is indicators as it should be. This means that the root Element will contain all data stored in the XML tree.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figau_HTML.png

| 'indicators'

../images/479588_1_En_21_Chapter/479588_1_En_21_Figav_HTML.png

| 'indicators'

We turn now to an alternative way to get the root Element that avoids the need for the parse( ) function.

21.4.2.2 Using fromstring( ) with Bytes and Strings

Given a response containing XML data in byte form, we can use the fromstring( ) method, associated with the etree type, to extract the root Element of the tree represented by the byte data.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figaw_HTML.png

| indicators

Like the parse( ) function, we do not need to specify an encoding to the fromstring( ) function, as the following code demonstrates.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figax_HTML.png

| indicators

We turn now to the situation of a response in text form, rather than binary form. In this case, we must read past the head matter of the response.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figay_HTML.png

We note here that the header we are skipping is a prolog part of the test of the XML data and is unrelated to the headers that come from the requests module. The header we are skipping is entirely contained in the body of the response we begin with.

We apply this function to both of our text-based XML data responses, to retrieve proper XML strings.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figaz_HTML.png

We can feed these XML strings into the fromstring( ) function, in much the way we fed it byte data, and again the function will give us the root of the XML tree.

../images/479588_1_En_21_Chapter/479588_1_En_21_Figba_HTML.png

| indicators

../images/479588_1_En_21_Chapter/479588_1_En_21_Figbb_HTML.png

| indicators

In all cases, we are able to retrieve the tree and root Element representing the XML data that comes to us either in local form or over the network.

21.4.3 Reading Questions

21.40

The first XML example does not specify an encoding but the second does. Could the second have gotten away without specifying encoding='UTF-8'? Justify your answer by actually running the code.

21.41

If the parse function is doing the decoding, why do we need to wrap our code in a try/except block? What could go wrong?

21.42

When reading XML file, how can you print the encoding, to see “UTF-8”? Hint: think back to how we did it for CSV and JSON.

21.43

Please experiment with response.text by purposely specifying the wrong encoding and seeing what is printed by util.print_text( ) for one of the XML files accessed over the network. Describe your results.

21.44

The reading shows how to use parse on Bytes. Can you also use parse( ) on response.text? Do you need to specify the encoding?

21.45

Consider the second block of code that invokes the fromstring( ) method. How does this demonstrate that encoding need not be specified?

21.46

Experiment to find out what happens if you purposely set the wrong encoding, e.g., with response2.encoding = 'ASCII' before invoking the fromstring( ) method on response2.content. Report your findings.

21.47

Explain the lambda function skipheader in detail. Why does this skip the header? Why is skipping the header important?

21.48

When feeding the fromstring( ) function text data, do we need to specify the encoding, e.g., with response1.encoding = 'UTF-8' before invoking skipheader( response1.text) ? Investigate by actually running the code, and report what you found.

21.4.4 Exercises

In many of the following exercises, we will show a curl incantation that obtains XML-formatted text data from the Internet. Your task will be to translate the incantation into the equivalent requests module programming steps and to obtain the parsed XML-based ElementTree structure from the result, assigning to variable root the root of the result. In some cases, we will ask for a specific method from those demonstrated in the section.

21.49

Using any method, get the XML data from school0.xml:

curl -s -o school0.xml \

https://datasystems.denison.edu/data/school0.xml

21.50

Using the bytes data in .content, a file-like object, and etree.parse( ) , get the XML data from school0.xml.

curl -s -o school0.xml \

https://datasystems.denison.edu/data/school0.xml

21.51

Write a function

getXMLdata(resource, location, protocol='http')

that makes a request to location for resource with the specified protocol, and then uses the bytes data in the .content of the response, with a file-like object, and etree.parse( ) , to get the XML data. On success, return the root of the tree. On failure of either the request or the parse of the data, return None.

21.52

The school0_32.xml resource is encoded with utf-32be. Use the method of setting the .encoding attribute of the response and then accessing the .text string body, and using fromstring( ) . Remember that fromstring( ) expects to start from an element, not from the header line, so you will need to skip the header to get the string to pass.

curl -s -o school0_32.xml \

https://datasystems.denison.edu/data/school0_32.xml

21.53

Repeat acquiring the school0_32.xml resource, encoded with utf-32be. This time, use the method of using the bytes data in .content, a file-like object, and etree.parse( ) .

curl -s -o school0_32.xml \

https://datasystems.denison.edu/data/school0_32.xml

Where in your code did the encoding of “utf-32be” come into play? Can you explain why? What does this mean for the getXMLdata( ) function you wrote previously?

21.54

Use any method you wish to obtain the XML data associated with the following GET request. Do not simply copy and paste the full url. Translate the set of query parameters into a dictionary to be used in the requests.get( ) invocation.

curl -s -o kivaloans.xml \

'http://api.kivaws.org/v1/loans/search.xml? \

sector=agriculture&status=fundraising'