© Springer Nature Switzerland AG 2020
T. Bressoud, D. WhiteIntroduction to Data Systemshttps://doi.org/10.1007/978-3-030-54371-6_22

22. Web Scraping

Thomas Bressoud1   and David White1
(1)
Mathematics and Computer Science, Denison University, Granville, OH, USA
 
Chapter Goals
Upon completion of this chapter, you should understand the following:
  • HTML as a tree and as an instance of a markup language, analagous to XML, but with structure, tags, and attributes intended for the rendering of web pages.

  • The particular data set structuring facilities in HTML:

    • the table HTML structure,

    • ordered and unordered lists (ol and ul) as HTML structures for organizing and displaying data.

Upon completion of this chapter, you should be able to do the following:
  • Acquire and parse HTML documents from web servers, using:
    • GET for a web page,

    • POST for custom access to a web page based on a form.

  • Use combinations of XPath and procedural traversal to construct useable data sets from the most common HTML structures used for rendering data:
    • HTML tables,

    • HTML lists.

The HyperText Markup Language (HTML ) is the format used for web pages in the Internet. The format provides for the logical structure of information for navigation, formatting, incorporation of pictures and other media, and structure of web pages, as well as for linking to other web pages and resources both local and remote relative to the web server providing the content. Web pages can display data in various forms, and while the intent of the web page may have been for the presentation of data, it can also serve as a data source in the context of our Data Systems.

The term web scraping refers to the programmatic access of the HTML of web pages and the extraction of useful information contained therein, even if the intent was for presentation of information and not the providing of data. We should note, at the onset, that data incorporated into web pages is the intellectual property of some owner of the data, often affiliated with the web server provider. As such, there are ethical and legal considerations whenever we decide to acquire data in this manner. The intellectual owner of data has the authority and right to determine the valid use of the data, or its acquisition through these programmatic means. Just because we can acquire data through web scraping techniques, does not mean that it is ethical or legal, based on the wishes of the owner of that data.

22.1 HTML Structure and Its Representation of Data Sets

Like XML, HTML can define a tree representing the information for a web page. Unlike XML, HTML is a specific grammar . This means that the tags and many of the allowed relationships between parents and children are well defined and have a semantic meaning associated with them.1 For instance, all HTML documents have a root node of html, and its children can only be head and body. Under head, one of the most common children is title, used in rendering to show the title of the web page, often in a browser tab. Often the children of head include nodes for metadata and scripts, which are code segments to be run within the context of the web browser when the page is loaded. The children of head can also define link elements to tie the page to specific formatting styles.

The body of an HTML document subdivides a web page using div nodes, which are structuring elements. These often have class and id attributes to distinguish their use. Programs that can generate web pages for developers use the structuring mechanism of div nodes to provide, within the child structure of the div, elements for navigation and separation of the objects developers can place into a web page.

The body also often has nodes that represent various levels of headers (in nodes tagged h1 through h6), and these are used to give, in the rendered page, sections, subsections, etc., for the document. We can use the p node to define paragraphs within the page, and define figures and captions. Nodes labeled span and with class attributes are used for grouping of underlying page elements and are often used to impose consistent styling for the rendering of elements.

The text below shows a minimal, but complete, HTML document that is fully capable of being rendered by a web browser. It includes a number of the elements described above.

<!DOCTYPE html>
<html>
  <head>
    <title>Title of the Page</title>
  </head>
  <body>
    <h1>First Level Heading</h1>
    <p>Paragraph defined in body.</p>
  </body>
</html>

In its original definition, HTML was not as strict as XML and the associated set of rules governing well-formedness that we saw in Chap. 17. For instance, a paragraph element and its start tag, <p>, was not constrained to have an end tag </p>. There were many such instances that, taken together, means that a single unambiguous tree could not always be parsed and constructed from an HTML source. Specific web browsers, like Chrome and Explorer, would define default heuristics for processing and rendering such cases, and web developers adapted to the rendering choices of the target browsers. As HTML has evolved, newer versions, such as XHTML , have addressed these potential problems by defining a strict version of HTML that adheres to XML well-formedness and is, in fact, defined as an XML Application [73].

The reason for pointing out the insufficiency of some HTML to follow a strict tree definition (in part, by omitting some end tags) is a practical one. In the space of performing web scraping, we must take HTML that has been developed by external entities and is hence out of our control, and perhaps not intended for client application processing, and be able to parse it into a tree structure like we did with XML. Depending on how “broken” the HTML may be, we have some programmatic solutions to build a best-effort tree from the HTML so that we can proceed with our web scraping task.

Figure 22.1 shows an HTML document structure, including annotation of line numbers. The full text of the HTML has been edited to allow us to more easily see the overall structure. The html tree starts on line 2 and ends on the line marked 120. The head subtree extends from line 3 to line 11, and we have edited out the contents of a script node within the head for clarity. Beneath the body, which starts at line 12, there are many div elements and significant nesting. Lines 18 through 27 contain the HTML for a sidebar used for navigation, with detail omitted at line 23. The main content of the page occurs at the div on line 38, which we have collapsed, again to illustrate the overall structure.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig1_HTML.png
Fig. 22.1

HTML document structure

The process of web scraping involves, first, one of discovery , wherein we use tools to acquire the HTML and then to ascertain its structure and, in particular, determine how the desired data within the page is represented in the HTML, and second, one of programmatic extraction , wherein we use our techniques from XML, including both XPath and procedural programming, to process the HTML tree and extract the data, transforming into a usable result, such as one or more data frames.

The remaining subsections describe the two most common ways of representing data sets in an HTML document. Understanding some of the possible HTML representations of data on a web page allows us to proceed through our discovery phase of web scraping. The HTML elements representing a data set would be subordinate in the tree and specific location based on the overall structure. For example, in Fig. 22.1, these data set-containing HTML elements would be present in the subtree rooted at the collapsed div at line 38.

22.1.1 HTML Tables

For representing a tabular data set within an HTML page, the most natural structure would be to use an HTML table element. This HTML structure can separate the header of the table (using thead) from the body of the table (using tbody), can define individual rows of the table (using tr/table row ), and, for fields within the row, can define header cells (using th/table header ) and data cells (using td/table data ). The result could look something like the rendered page from a browser as shown in Fig. 22.2.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig2_HTML.png
Fig. 22.2

HTML table rendered by browser

Fortunately, whether a table is simple and unadorned, as illustrated above, or uses borders and shading and other more complex formatting, the same table HTML structure is used, the difference coming from style sheets,2 class attributes, and possible subordinate structure in the definition of table header (th) and table data (td) elements.

Figure 22.3 presents the HTML for the table rendered in Fig. 22.2. The table element has two children of thead and tbody. The child of thead consists of a single tr (table row) element, which has six table header (th) cell elements. The column names are in the text of each of these elements. The tbody has multiple table row elements, one for each data row of the table, and each row has a td, table data, element corresponding to the six fields of the table. For illustration, the first two data rows are shown and the remaining four are collapsed.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig3_HTML.png
Fig. 22.3

HTML for simple table

22.1.2 HTML Lists

Another very common structure in HTML for representing collections of data is a list structure . In the list, each item could represent a row of data. Within the row, there could be a set of HTML structure elements for representing the fields within the row. This is very common in practice. For example, when one searches for top song lists, a site may present the result in list form. The same is true of web pages that visibly look like ordered lists, e.g., displaying recipes (ordered by steps), tourist attractions (ordered by rating), video games (ordered by popularity), lists of publications, etc. An example is shown in Fig. 22.4.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig4_HTML.png
Fig. 22.4

HTML list structure displayed

In the discovery step of web scraping, examination of this page reveals that the set of publications are organized as an HTML ordered list , which uses the HTML structuring tag ol. The children of ol are line items , which use the tag li. Each of these li structures gives a single row of the desired data set. Within each are structuring elements for what we might consider the fields of the row. For instance, we might want fields for the name of the publication, the publisher, the year, the co-authors, and any associated links. A snippet of the HTML for Fig. 22.4, for the list of Books and Chapters, is displayed in Fig. 22.5. Clearly, regular expressions would be useful for extracting the data desired.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig5_HTML.png
Fig. 22.5

HTML list structure to scrape

In this particular example, the children of the li tags are a set of span elements. In practice, one often sees div elements as children of li elements as well. One could also imagine, for each li defining a row, a nested list structure with another ordered list (ol) or perhaps an unordered list (ul), e.g., if items in a recipe list had ordered sub-items labeled a, b, c, etc.

As another example, consider the web page displayed in Fig. 22.6. In this case, investigation of the web page reveals a nested list structure. In particular, there is an unordered list , tagged as ul, whose list elements , li, have children that are themselves ul items, one for each of the years 2007 and 2017. Within these, there is a third level of unordered list ul elements, whose li are the individual indicators of pop and gdp.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig6_HTML.png
Fig. 22.6

HTML nested lists

If we focus on just the HTML for this nested list structure, we see this nesting in the HTML displayed in Fig. 22.7, which uses indentation and line number annotation to help understand the structure. The “row” of FRA is a li that extends from line 2 through line 17. After the text of li element, the next level ul is defined; the first child li extends from line 4 through 9 and defines the 2007 indicator values, with a subordinate ul whose items give individual indicators on lines 6 and 7. The GBR subtree is structured similarly, as is the USA subtree, which is collapsed in this display.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig7_HTML.png
Fig. 22.7

HTML nested list structure

HTML defines many more tags than we have discussed here. In Table 22.1, we present a list of some of the most common tags you are likely to encounter, along with a brief description of their use.
Table 22.1

Common HTML tags

Tag

Description

a

Defines a hyperlink, which is used to link from one page to another, or to a reference point within the current page. The href attribute defines the link itself, while the text gives the displayed version

body

Defines the body contents of an HTML document and is a child of the html root

div

Defines a division or container for an HTML document to allow Structure, and may be nested to group children as needed. The id attribute is used to identify elements, and the class attribute is also used for categorizing containers and for common styling

form

Defines an HTML form for user input. Some possible children include label, select, and input

h1 to h6

Defines various levels of HTML headings

head

Child of the html root, and used to define the metadata for an HTML document before the definition of the body

html

Defines the root of an HTML tree

input

Defines an input field as part of a form, where user can enter data

label

Defines a label within a form for, among others, input and for select elements

li

Defines a list item, and is a child of ordered lists, ol, and unordered lists, ul

link

Relates current document to an external resource, often for style information

meta

Defines specific metadata in the head of an HTML document

ol

Defines an ordered list structure. Children are li elements

option

Defines children of a select as dropdown items in the select list

p

Defines a paragraph

script

Used to embed a client-side script of JavaScript code

select

 

span

Defines an inline container used to mark up a part of a document, and is finer grained than a div type of container

table

Defines an HTML table structure, with possible children thead and tbody, or directly to tr rows

tbody

Defines the body of a table and, if present, is a child of table

td

Defines a cell in a table

th

Defines a header cell in a table

thead

Defines the header potion of a table and, if present, is a child of table

title

Defines the title of the document to be shown in the browser’s title bar or tab

tr

Defines a row in a table

ul

Defines an unordered list structure. Children are li elements

22.1.3 Reading Questions

22.1

Web scraping has received much media attention in recent years. Please share an interesting real-world application of web scraping that you are aware of.

22.2

In the reading, the data sought is located in a table under the h2 header. Will this always be the case? Either justify a “yes” answer or find an example online to demonstrate a “no” answer.

22.3

The reading discusses the evolution of HTML pages over time. It would be a mistake to only learn the newest form of HTML, e.g., because one day you may need to web scrape data on a web page created a long time ago. Indeed, you can even web scrape pages that no longer appear on the Internet, because the “way back machine” stores copies of many web pages. Please explore https://​archive.​org/​web/​web.​php and think of a real-world example where you would want to access an old copy of a web page, or a web page that was taken down.

22.4

Given that web pages can be changed, please discuss best principles for organizations hosting data that a client may want to scrape, and discuss what a client should do before attempting to scrape a page with old code.

22.5

The first figure in the reading shows that in HTML it is possible for two or more tags to appear on the same line, meaning you have to read an HTML document carefully to discover the tree structure. A nice way to see the HTML text of a web page is to open the page in Chrome, go to “View” -> “Developer” -> “View Source.” Please follow this procedure with the web page http://​personal.​denison.​edu/​~whiteda/​math401spring201​6.​html

Find the table inside the web page and answer the following questions:
  1. 1.

    What tag is used to start and end the table?

     
  2. 2.

    Does this table have an HTML table header tag?

     
  3. 3.

    What XPath expression could you use to find the number of rows of the table, and what is the real-world meaning of this number?

     
22.6

Similar to the question above, please “View Source” on this page: http://​datasystems.​denison.​edu/​data/​ and determine the tag name associated to each of the links to data sets listed on the page. How could you determine the number of data sets listed on the page? Is it wise to find and count all elements on the page with the tag you just discovered? Explain.

22.7

The reading mentions “ethical and legal considerations” involved with web scraping. Please think up and describe a situation where, even though a web page is publicly available, it would be unethical to scrape data from the web page.

22.8

The reading mentions “ethical and legal considerations” involved with web scraping. Please describe what you would do in order to determine if it was legal to scrape a web page. Hint: you might consider using a search engine to search for “robots.txt”

22.9

When data is hosted on a publicly available web page, who should be considered the owner of that data? Are you the owner simply because you are looking at the data? This is a difficult question, but please consider it and make an argument in favor of your position. It may help to think about how you would feel if the data consisted of photos of you.

22.10

Review the Terms of Service for Yelp, and additionally search their support pages. Is it legal/permitted to web scrape Yelp pages? Provide specific link references to support your answer.

22.11

Review the Terms of Service for billboard.com and additionally search their support pages. Is it legal/permitted to web scrape, say, the “Hot 100” page? Provide specific link references to support your answer.

22.12

Suppose a web page has information you want to use, but you cannot determine if you are allowed to scrape the web page. Please look up the term “fair use” and describe what you learned and how it might affect you in a situation like this.

22.13

If you blindly web scrape, e.g., using the app SiteSucker to download all files on a given web portal, then you could easily end up possessing files that are illegal to possess (e.g., child pornography). Conversely, it is also possible to view illegal materials without downloading them. Please discuss how you believe the law should cope with these dual situations.

22.2 Web Scraping Examples

We now proceed through a graduated set of examples of web scraping. We start with a simple table, but one that occurs within a realistic web development that uses an application for generating web pages. This is, in fact, the example from Figs. 22.2 and 22.3 above. We then proceed to a table embedded in a publicly available site in Wikipedia. These two examples both use the HTML table structure to represent their data. The exercises will allow the reader to explore web scraping of a nested list structure. Our final example will explore a web page whose content is determined from a simple HTML form, and whose data is acquired through a POST operation.

Each example will describe parts of the discovery process, by which an individual determines the structure used to represent the data on the web page, and the data extraction through the application of the procedural and declarative steps using XML/XPath operations (see Chap. 16).

Before we proceed with the examples, we discuss some considerations that apply to when we retrieve HTML in a request that can make it slightly different than the (mostly similar) techniques of acquiring XML data as explored in Sect. 21.​4.

22.2.1 Formulating Requests for HTML

In Sect. 22.1, we described how HTML, as produced by a web developer, might not follow strict tree formatting. Nonetheless, to accomplish web scraping, we need a tree that we can use for both discovery and for programmatic extraction of data. So, instead of using the default ElementTree parser, or specifying a custom XML parser , we define a parser that can take many forms of “broken” HTML and parse into a properly formed ElementTree.

For instance, consider the string version of a “broken” HTML:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figa_HTML.png
This HTML lacks the closing tag corresponding to the html, head, title, and body start tags and closes a h1 with an h3. But using the ElementTree constructor for an HTMLParser and then using that to parse the above string, converted to a file-like object, we get the following tree as a result:3
../images/479588_1_En_22_Chapter/479588_1_En_22_Figb_HTML.png
| <html>
|   <head>
|     <title>test</title>
|   </head>
|   <body>
|     <h1>header title</h1>
|   </body>
| </html>

All unclosed end tags have been closed, and the <h1> was closed with a </h1>. The parser uses known structure and common nesting to make a best-effort construction of a tree from the HTML input.

If, in the course of web scraping, there is HTML input that is even more malformed and unable to be parsed in the above manner, the lxml module has a submodule named html that has its own, even more robust, parser. In addition, the resultant tree has some additional methods and facilities that give additional power when working with HTML. These aspects are beyond the scope of our more basic goals in this chapter, but we show the construction of a tree using the alternative parser in the html submodule:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figc_HTML.png
| <html>
|   <head>
|     <title>test</title>
|   </head>
|   <body>
|     <h1>header title</h1>
|   </body>
| </html>

There are other terrific packages out there that specifically support web scraping. For instance, we highly recommend the package “Beautiful Soup” [61]. Because of the commonality in what we are doing here with our earlier treatment of XML, we chose, in this book, to continue to use lxml.

In the examples that follow, we will use the requests module to make HTTP requests. The first set of examples will use GET requests, and the last will develop the use of a POST request in order to obtain the HTML that will act as a source for our web scraping. We will again use functions in our custom util module for building URLs and printing results.

22.2.2 Simple Table

Our first example entails an HTML table on the page "/ind2016.html" at datasystems.denison.edu. This page is the same one discussed in Sect. 22.1.1 with a straightforward structure, and little that “adorns” the presentation of the table. Our goal is to acquire the page, discover how the data set is represented, and to extract the data into a row and column data frame.

We submit a GET request, passing the url for an HTML resource, and use the raw bytes version of the body of the response to obtain a parse tree to use for subsequent operations. To display the results, we again use util.print_xml( ) (see Sect. A.1) and use its parameters to limit the depth of the recursion and the maximum number of children at a particular level. This way we can help our discovery and examine subtrees as we explore a given HTML.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figd_HTML.png
| <html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' la
|   <head>
|     <meta charset='utf-8'></meta>
|     <meta http-equiv='X-UA-Compatible' content='IE=edge'></
|     <title>ind2016 | Introduction to Data Systems</title>
|      ...
|   </head>
|   <body class='sandvox has-page-title allow-sidebar no-cust
|     <div id='page-container'>
|       <div id='page'>
|         <div id='page-top' class='no-logo has-title has-tag
|         </div>
|         <<cyfunction Comment at 0x1293d1350>>page-top</<cyf
|         <div class='clear below-page-top'></div>
|          ...

In this case, as shown in the prefix printout shown above, and in greater detail in Fig. 22.1, the structure of the HTML has a head and a body, and in the body, we have significant nesting of div nodes.

To be able to process a table for its data content, we must find the HTML table element in question from within the overall tree of HTML. We can start by using XPath and querying for the set of table elements within the body of the document:
../images/479588_1_En_22_Chapter/479588_1_En_22_Fige_HTML.png
| 1
We find a single table is contained in the body of the HTML, so we assign the first (and only) element in the nodeset and perform an exploratory print of the table:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figf_HTML.png
| <table class='table table-bordered table-hover table-conden
|   <thead>
|     <tr>
|       <th title='Field #1'>code</th>
|       <th title='Field #2'>country</th>
|       <th title='Field #3'>pop</th>
|        ...
|     </tr>
|   </thead>
|   <tbody>
|     <tr>
|       <td>CAN</td>
|       <td>Canada</td>
|       <td align='right'>36.26</td>
|        ...
|     </tr>
|     <tr>
|       <td>CHN</td>
|       <td>China</td>
|       <td align='right'>1378.66</td>
|        ...
|     </tr>
|     <tr>
|       <td>IND</td>
|       <td>India</td>
|       <td align='right'>1324.17</td>
|        ...
|     </tr>
|      ...
|   </tbody>
| </table>
We find that the table has children of thead and tbody, that there is a single row in thead that has th elements for each of the column names, and that the data is contained in the tr elements of tbody. We can obtain a vector of column names via XPath:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figg_HTML.png
| ['code', 'country', 'pop', 'gdp', 'life', 'cell']
We have two options for acquiring the data itself. Given the simple nature of the table, we see that the field values are uniformly represented in the text of each of the td nodes. So we could construct a list of the entire set of field values, and we could then use that single list to construct a list of row lists representation of the data:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figh_HTML.png
| [
|   [
|     "CAN",
|     "Canada",
|     "36.26",
|     "1535.77",
|     "82.3",
|     "30.75"
|   ],
|   [
|     "CHN",
|     "China",
|     "1378.66",
|     "11199.15",
|     "76.25",
|     "1364.93"
|   ],
|   [
|     "IND",
|     "India",
Alternatively, we could construct a dictionary of column lists (DoL) representation. Here, we can design an XPath expression that yields a list of data values for a particular column. This would use a predicate that involves the 1-relative position of the column (i.e., starting to count at 1 instead of 0). This strategy is then employed in a loop to create each of the dictionary columns based on the set of column names obtained above:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figi_HTML.png
| {
|   "code": [
|     "CAN",
|     "CHN",
|     "IND",
|     "RUS",
|     "USA",
|     "VNM"
|   ],
|   "country": [
|     "Canada",
|     "China",
|     "India",
|     "Russia",
|     "United States",
|     "Vietnam"
|   ],
|   "pop": [
|     "36.26",
|     "1378.66",
Given either representation, it is straightforward to create a pandas data frame for our data. We illustrate with the DoL representation.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figj_HTML.png
|             country      pop       gdp   life     cell
| code
| CAN          Canada    36.26   1535.77  82.30    30.75
| CHN           China  1378.66  11199.15  76.25  1364.93
| IND           India  1324.17   2263.79  68.56  1127.81
| RUS          Russia   144.34   1283.16  71.59   229.13
| USA   United States   323.13  18624.47  78.69   395.88
| VNM         Vietnam    94.57    205.28  76.25   120.60

If a web page were to have multiple tables contained within the body of the HTML, our discovery step would need to be more involved, possibly printing parts of the structure of multiple tables returned from the nodeset = root1.xpath( "/html/body/div//table") and determining which one contained the desired data.

Many other variations beyond a simple table like this one are possible. We will see one such variations in our example of Sect. 22.2.3. One of the more common ways in which variations occur is when a data cell (in a td element) has additional structure. For formatting, the td might have a span child. Or part of the data in the cell might be incorporated into a link to another web page or to a relative link within the current web page. In this case, the data might be part of the text of the link element .

22.2.3 Wikipedia Table

Consider the following common scenario: a client application developer is seeking data to complement other parts of their application. For instance, they may find they need data on the latest estimated population of each of the states in the United States. While the data may be available from a number of open data sources, (census.gov, for instance), the developer finds the data they want on Wikipedia: https://​en.​m.​wikipedia.​org/​wiki/​List_​of_​states_​and_​territories_​of_​the_​United_​States_​by_​population. Figure 22.8 shows a screenshot of a subset of this data from the referenced Wikipedia web page. The subset is in terms of both dimensions of the columns in the table as well as rows in the table.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig8_HTML.png
Fig. 22.8

Wikipedia population table

22.2.3.1 Goal

We see in the rendered page a table of state populations . Population data and ranks are relative to the 2010 census and population estimates for 2019. In our case, we are interested in the most recent data, even if it is an estimate, and so we want to extract the current rank (as an integer), the string of the name of the state (we do not care about the state flag picture), and the estimate of the population as of July 1, 2019. These are the first, third, and fourth columns in the table.

Before engaging in web scraping, a developer must look at the acceptable use policy of a provider and also look at their policy on automatic access to their pages. In this case, such an investigation revealed that programmatic access should go through a defined API and that requests can be limited in number and frequency.4

22.2.3.2 Discovery

Following Wikipedia policy and using their simple API, where we can build a URL for a particular desired page by constructing a resource path relative to /api/rest_v1/page/html/, we obtain and parse the tree for the page giving the population by state data set:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figk_HTML.png
In this case, we suspect there are multiple table HTML elements in the page, so we use XPath to obtain the collection and delve into the tables to determine the correct one:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figl_HTML.png
| 5
If there are multiple tables, we need to discover which table carries the data we desire. We happen to know that a characteristic of data-carrying Wikipedia tables is that they are sortable and carry an xml class attribute with a "wikitable sortable" 5 value.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figm_HTML.png
| 2
We look more closely at the first of these two tables:
../images/479588_1_En_22_Chapter/479588_1_En_22_Fign_HTML.png
| <table class='wikitable sortable' style='width:100%; text-a
|   <tbody>
|     <tr style='vertical-align: top;'>
|       <th colspan='2' style='vertical-align: middle'>Rank</
|       <th rowspan='2' style='vertical-align: middle'>State<
|       <th colspan='2' style='vertical-align: middle'>Census
|        ...
|     </tr>
|     <tr>
|       <th>Current</th>
|       <th>2010</th>
|       <th>Estimate,
|         <br></br>
|         <sup ...>
|         </sup>
|       </th>
|        ...
|     </tr>
|     <tr>
|       <td align='center'>
|         <span ...>1</span>
|       </td>
|       <td align='center'>
|         <span ...>1</span>
|       </td>
|       <td style='text-align: left;'>
|         <span ...>
|         </span>
|         <a href='/wiki/California' title='California'>Calif
|       </td>
|        ...
|     </tr>
|      ...
|   </tbody>
| </table>
We discover that this first table is indeed the table we are looking for. The print of part of the tree above shows us that there is a tbody child of table present, but, in this case, no thead. Looking more closely at the first two tr child nodes of tbody, we see that they are populated with th elements, and these correspond to the two rows of the header of the table seen in Fig. 22.8. The data-carrying rows begin with the third tr child of the tbody. To continue our discovery, we use XPath relative to the table to obtain the third row entry and examine its structure.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figo_HTML.png
| <tr>
|   <td align='center'>
|     <span ...>1</span>
|   </td>
|   <td align='center'>
|     <span ...>1</span>
|   </td>
|   <td style='text-align: left;'>
|     <span ...>
|       <noscript>
|       </noscript>
|       <img width='23' height='15' class='thumbborder image-
|     </span>
|     <a href='/wiki/California' title='California'>Californi
|   </td>
|   <td>39,512,223</td>
|   <td>37,253,956</td>
|    ...
| </tr>

The data for this row of the table is contained in the td elements, and this first row corresponds to California, whose first and second columns have value 1, and the third field has a picture and a link whose label is California.

We observe:
  1. 1.

    The current state rank is in first td in row and is the text of a span node under the td Element. It is not always the case that the first visible table entry corresponds to the first td node in a row. Some tables can use additional td elements in the set of rows, used for spacing, borders, and other rendering results.

     
  2. 2.

    The state rank at the last census in 2010 is in the second td in the row; we will disregard this field based on our goals of wanting the most recent estimate.

     
  3. 3.

    The third td contains the state information, and we will explore that further below.

     
  4. 4.

    The fourth td contains the estimated population in 2019, and here the value is directly in the text of the td element.

     

So we will use positioning to get the columns (as the relative td within the row) that we are interested in.

Focusing on the data cell where we find the state information:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figp_HTML.png
| <td style='text-align: left;'>
|   <span ...>
|     <noscript>
|       <img alt='' src='//upload.wikimedia.org/wikipedia/com
|     </noscript>
|     <img width='23' height='15' class='thumbborder image-la
|   </span>
|   <a href='/wiki/California' title='California'>California<
| </td>

We observe that the name of the state is embedded in a hyperlink , given in the node with tag: a. This node is beneath the td; the name of the state is in the text property of the a node.

22.2.3.3 Data Extraction

Understanding the tree structure of the table, we can now acquire the data for columns 1, 3, and 4 of the table using XPath. We know that the data-carrying rows begin after row position 2, and because the table has a row for the District of Columbia, we want to extract data from 51 rows. We do not want to go further, as these rows contain information on territories and have aggregate data.

Extracting column 1 from rows 3 through 53, and finding the data in the text property of a span underneath the appropriate td, we show the first four entries of the 51 extracted.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figq_HTML.png
| [1, 2, 3, 4]
Extracting the state strings from column 3 from rows 3 through 53, by traversing into the a node and extracting the text property, we again show the first four entries of the 51.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figr_HTML.png
| ['California', 'Texas', 'Florida', 'New York']
Finally, for the population, we need to do a little work to convert a comma-separated digit string into an integer, so we define a lambda function to perform the conversion and then, after we use XPath to extract a vector of strings, use a list comprehension to apply the conversion:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figs_HTML.png
| [39512223, 28995881, 21477737, 19453561]
Now that we have the three desired columns, and we can define a dictionary of column list representation and construct the data frame:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figt_HTML.png
|                state  population
| rank
| 1         California    39512223
| 2              Texas    28995881
| 3            Florida    21477737
| 4           New York    19453561
| 5       Pennsylvania    12801989
| 6           Illinois    12671821
| 7               Ohio    11689100
| 8            Georgia    10617423
| 9     North Carolina    10488084
| 10          Michigan     9986857

22.2.4 POST to Submit a Form

22.2.4.1 Goal

Consider the website in Fig. 22.9, providing weekly gasoline prices in California for a particular year. This website is available at URL https://​ww2.​energy.​ca.​gov/​almanac/​transportation_​data/​gasoline/​margins/​index_​cms.​php. URLs whose final element of the resource path ends in .php are known as PHP files , and these allow a website to create dynamic content . This dynamic content is often the result of a user in a web browser interacting with user interface elements known as forms , where they can enter information, select items from dropdown boxes, and otherwise associate the “answers” to a form with defined fields that are part of the form. When this process is complete, they click an action button that submits the form. In HTTP, a form submission uses a POST , where the body of the request message encodes the set of form field names and their mapping to the user’s selected values. The operation of such a POST was illustrated in Sect. 20.​3.​2.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig9_HTML.png
Fig. 22.9

CA gas prices web page

By using this mechanism, a web scraping client can design and construct a request and then receive the resultant HTML. With the result, we can then process the constituent table, ul, or ol data and extract it into a usable form. This section will demonstrate this technique for the CA Gas Prices web page.

From a displayed web page perspective, at the bottom of the same page, Figs. 22.10 and 22.11 show the simple forms interface provided on this web page. We see both the dropdown, from which the user can select a particular year, and the selected value and button used to submit the form.
../images/479588_1_En_22_Chapter/479588_1_En_22_Fig10_HTML.png
Fig. 22.10

CA gas prices year selection

../images/479588_1_En_22_Chapter/479588_1_En_22_Fig11_HTML.png
Fig. 22.11

CA gas prices form

22.2.4.2 Discovery

Given a web page with a form like in our example, our discovery must determine the specifics about the form and how we could construct a client-based POST that is able to make a request so that the target web server would respond and so that the client can acquire the HTML-based data.

In this particular example, through examination of the HTML, we can discover the specification of the form on the web page:

<form action=’index_cms.php’ method=’post’
      style=’margin-left:10px;’>
  <label for=’year’>
    <select name=’year’ id=’year’>
        <option value=’2020’>Select Year</option>
        <option value=’2020’>2020</option>
        <option value=’2019’>2019</option>
        <option value=’2018’>2018</option>
        <option value=’2017’>2017</option>
        ...
        <option value=’1999’>1999</option>
    </select>
  </label>
  <input name=’newYear’ type=’submit’
         value=’Get different year’ />

The <select> tag defines the dropdown, with <option> entries for each of the items displayed. The name and id attributes both have value year, and this will be the field name used for one of the entries in the form, and the user-selected value is determined by the value attribute for a selection.

The <input> tag defines the submission button, where the newYear will be the second field name of the form, and this entry will have the value given by the value attribute, so the form entry newYear will be associated with Get different year.

The <form> tag defines the overall operation of the form, using action and method to indicate that, on a submission, the result should be a POST request to the resource path given by action, which here is index_cms.php. In this case, this means that the POST of the form is back to the same web page as the original GET.

A discovery process often uses examination of the HTML and couples that with using a web browser and associated developer tools, looking at network message interaction, to interact with the form and experimentally determine what happens when a user selects a year and clicks the Get different year button.

Summarizing our discovery:
  1. 1.

    The page itself, in HTML, has our desired data by week, and each week is a separate table object, with a tr (row) for each of the variables and associated values for that week.

     
  2. 2.

    The form consists of two entries:

    • year field maps to the four digit string of the desired year and

    • newYearfield maps to the constant value Get different year

     
  3. 3.

    The HTTP method is POST, which means that the form entries should be in the body of the POST as URL-encoded field=value for each field, separated, per URL encoding , with an ampersand (&).

     
  4. 4.

    The resource path/URI of the POST is the same resource path as the original.

     

22.2.4.3 Request and Data Extraction

In contrast to most earlier examples, we need to change two things in using the requests module to make this request:
  1. 1.

    We must get a POST request instead of a GET request.

     
  2. 2.

    The request must include a body that consists of key-value pairs.

     
For (1), the requests module has a post top level function. For (2), we construct a dictionary with the desired mappings. We pass that to the post( ) using named parameter data. The requests module is very flexible in how it interprets an argument provided through data. If it is a string, it simply puts the encoded bytes of the string in the body. If it is a dictionary, like in this case, it interprets it and generates a URL-encoded version, as we will see below. Suppose we want to get CA gas price data for the year 2001:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figu_HTML.png
| POST body: year=2001&newYear=Get+different+year
The print( ) helps show the result of using the payload dictionary and specifying it as the body of the request by using the data= named parameter to the post( ) . The requests module translated the dictionary into a URL-encoded set of field=value entries separated by &, and with embedded spaces translated into the + character.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figv_HTML.png
| {
|   "User-Agent": "python-requests/2.22.0",
|   "Accept-Encoding": "gzip, deflate",
|   "Accept": "∗/∗",
|   "Connection": "keep-alive",
|   "Content-Length": "36",
|   "Content-Type": "application/x-www-form-urlencoded"
| }

Note the specification in the request headers, through Content-Type, indicates that the body of the post is a URL-encoded form.

Since our request was successful, we can take the bytes of the response and parse the returned HTML into a tree:
../images/479588_1_En_22_Chapter/479588_1_En_22_Figw_HTML.png
| html
The tables that we desire all use the table HTML structure and are positioned as direct children of div elements, where the div has a class attribute of "contnr".
../images/479588_1_En_22_Chapter/479588_1_En_22_Figx_HTML.png
| 53
../images/479588_1_En_22_Chapter/479588_1_En_22_Figy_HTML.png
| <table>
|   <caption>
|     <h2>Dec 31</h2>
|   </caption>
|   <tr>
|     <td></td>
|     <th scope='col'>Branded</th>
|     <th scope='col'>Unbranded</th>
|   </tr>
|   <tr>
|     <th scope='row' class='tWidth'>Distribution Costs, Mark
|     <td class='numbers'>
|       <span ...>-$0.07</span>
|     </td>
|     <td class='numbers'>
|       <span ...>-$0.04</span>
|     </td>
|   </tr>
|   <tr>
|     <th scope='row'>Crude Oil Costs</th>
|     <td class='numbers'>$0.41</td>
|     <td class='numbers'>$0.41</td>
|   </tr>
|    ...
| </table>

Given that each table represents a single week and that the rows in the table represent variables, then each table will give us a single row for a table representing the data of the page. With an eye toward collecting a list of dictionaries for construction of the table, we will develop processing of one table to result in one (row) dictionary.

We can see from the print of the tree that the first piece of data needed, the date, is in a caption child of the table. Let us postulate data columns:

['distrib_cost', 'crude_cost', 'refine_cost', 'storage',
 'state_local_tax', 'state_excise_tax', 'fed_excise_tax',
 'retail_price']
Assume we just want the branded data.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figz_HTML.png
../images/479588_1_En_22_Chapter/479588_1_En_22_Figaa_HTML.png
| [0.41, 0.31, 0.0, 0.08, 0.18, 0.18, 1.1]
../images/479588_1_En_22_Chapter/479588_1_En_22_Figab_HTML.png
| {
|   "distrib_cost": 0.41,
|   "crude_cost": 0.31,
|   "refine_cost": 0.0,
|   "storage": 0.08,
|   "state_local_tax": 0.18,
|   "state_excise_tax": 0.18,
|   "fed_excise_tax": 1.1,
|   "date": "Dec 31"
| }
In the interest of good functional abstraction , and because we need to repeat the processing for each of the weekly tables in the web page, we define a function to perform the work for a single table, and which returns a dictionary representing a single row of the desired table, with column fields mapping to values obtained as shown above.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figac_HTML.png
Now it is just a matter of applying our function to each of the tables. Since we desire a list result, using a list comprehension makes this easy.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figad_HTML.png
Finally, we construct the data frame and set its index.
../images/479588_1_En_22_Chapter/479588_1_En_22_Figae_HTML.png
|         distrib_cost  crude_cost  refine_cost  storage
| date
| Dec 31          0.41        0.31         0.00     0.08
| Dec 24          0.00        0.44         0.23     0.00
| Dec 17          0.08        0.39         0.22     0.00
| Dec 10          0.14        0.38         0.23     0.00
| Dec 03          0.14        0.42         0.23     0.00
| Nov 26          0.17        0.39         0.27     0.00

22.2.5 Reading Questions

22.14

Please think up a reason why there might be a lot of broken HTML on the web (i.e., malformed from an XML perspective).

22.15

Please justify the value of the print_xml( ) function to a working data scientist. What would life be like if we did not have this function?

22.16

The reading says “Before engaging in web scraping, a developer must look at the acceptable use policy of a provider and also look at their policy on automatic access to their pages.” Please find an example of an “acceptable use policy” and “policy on automatic access” and give your example here, including the link to it. In general, how can you find this kind of information?

22.17

Please study the California gasoline web page used in the POST example and explain what is meant by “Branded” versus “Unbranded” in real-world terms: https://​ww2.​energy.​ca.​gov/​almanac/​transportation_​data/​gasoline/​margins/​index_​cms.​php

22.18

In the first XPath code shown in the reading, explain why we use

nodeset = root1.xpath("/html/body/div//table")

instead of

nodeset = root1.xpath("/html/body/div/table")
22.19

With reference to the content printed by print_xml, please explain carefully why the XPath expression ./thead/tr/th/text( ) yields a list of column names.

22.20

In the ind2016 example, consider the line

tdlist = table.xpath("./tbody/tr/td/text()")

Please describe tdlist carefully after this line is executed. For instance, how big is this list? What is the type of the items in tdlist? How does it relate to the table?

22.21

In the code for the LoL solution of extracting ind2016 data, what is the purpose of fieldcount?

22.22

In the DoL approach, what is the point of the format string in the following snippet of code?

for index, column in enumerate(column_names):
    xpath = ".//table//tr/td[{}]/text()".format(index+1)
22.24

The reading invoked state_column = table.xpath( xs) for the XPath expression below

xs = “.//tr[position() > 2 and position() < 54]/td[3]//a/text()”

Can you think of a way to rewrite this to avoid the use of the position function? You might consider consulting the reading to make sure you understand what this expression is trying to accomplish.

22.25

The reading shows two ways of building a pandas data frame from a web page: using either a LoL or DoL. Please write a sentence describing how each approach works, then answer: which do you find more intuitive and why?

22.26

In what way, specifically, does the XPath expression in the reading "./tr[position( ) >1]/td[1]/text( ) " extract the branded rather than unbranded data?

22.27

Consider the XPath expression used to extract the list of states. Suppose you wanted to extract the list of links (one per state), e.g., for use in a web crawling program. What XPath expression would you use?

22.28

Instead of a list comprehension applying the lambda function to create pop_column, could we use map? Explain.

22.29

When describing the POST, the reading discusses the resource path given by action. Can you think of an example where a POST would want to send data to a different resource path than the page the user is currently on?

22.30

Explain the use of the XPath expression

"//div[@class='contnr']/table"

by referencing the HTML structure of the web page in question.

22.2.6 Exercises

22.31

Write a function

getHTMLroot(resource, location, protocol="https", params={})

that performs an HTTP GET for the specified resource at location, using protocol, which defaults to https. If params is specified, these should be used as a dictionary for query parameters for the request. If the GET is successful, the function should verify that the result is HTML. If that is true, parse the HTML and return the root of the tree. If the resource is not found, or if the resource is not HTML, or the resource could not parse, return None.

22.32

Write a function

getLinks(resource, location, protocol="https")

that retrieves the HTML tree from resource at host location, using the specified protocol, and then finds all the external hyperlinks referenced in the document. Return a list of strings for the URLs. Recall that a hyperlink is referenced with the tag: a, and the href attribute within that tag contains the link itself, as opposed to the displayed text. Since many links in a document can be internal, we want to restrict to those that contain a URL, not a URI. Bonus points if the algorithm only returns the unique external links.

22.33
Consider the following web page: https://​datasystems.​denison.​edu/​databases/​index.​html. Using a combination of XPath and util.print_xml( ) of the tree or subtrees, answer the following questions:
  • What is the first ordered or unordered list found in the tree? If you wanted that specific list, and wanted to protect against another list being added as the first one, what steps would you take to get the desired list?

  • What would the XPath be to get the div whose class attribute is "RichTextElement" that is a descendent of the div whose id attribute is "main-content"

  • Given the (single) node that results from answering the previous question, write a single XPath expression that would get the set of strings naming the available databases.

  • Repeat the above, but use procedural steps to accumulate the list of databases.

  • In English, describe what you learned about the HTML structures involved in this collection of information. Is this the same as one of the structural options considered in the chapter?

22.34

Consider the page: https://​datasystems.​denison.​edu/​ind2016_​list.​html. On that page is a list-structured data that consists, at the outer level, of an ordered list, and at the inner level, an unordered list of indicator values. It also has some adornment in font face, like bolding and italics. Using the techniques of this chapter, write the code to scrape this page and create a pandas data frame containing rows for each of the five countries, and columns for code, name, gdp, pop, and life.

22.35

Consider the page: http://​datasystems.​denison.​edu/​ind0.​html, which is the page described in the book, which contains triple-nested lists. Note also that all the lists are unordered. Write code to web scrape and create a pandas data frame with three rows, one each for FRA, GBR, and USA, and has a code and four data columns, for each of pop and gdp for years 2007 and 2017. The best solution would use a two-level column index.

The last two exercises ask the reader to scrape tables on Wikipedia. We would ask that you adhere to the acceptable use policy and use the Wikipedia API to access the pages. We also caution that websites not under our control are constantly changing, and while, at the time of this writing, these exercises were feasible, changes could make adaptation necessary.

22.36

At the Wikipedia, page called

"List_of_novels_considered_the_greatest"

is a table of novels considered great by one or more experts. Using the resource path prefixed with "/api/rest_v1/page/html/", obtain the HTML tree from that page and scrape the page, building a pandas data frame of the results. You are to design the table as you see fit for the columns to be included. Make sure your solution is robust to the case of additional novels being added to this list.

22.37

At the Wikipedia, page called

"List_of_Cash_Box_Top_100_number-one_singles_of_1960"

is a table containing, by week, the top songs from the year 1960. Using the resource path prefixed with "/api/rest_v1/page/html/", obtain the HTML tree from that page and scrape the page, building a pandas data frame of the results. You are to design the table as you see fit for the columns to be included. This exercise is slightly more difficult than the previous one, because the table has entries that “span” more than one row, when the same song is top for multiple weeks.