HTML as a tree and as an instance of a markup language, analagous to XML, but with structure, tags, and attributes intended for the rendering of web pages.
The particular data set structuring facilities in HTML:
the table HTML structure,
ordered and unordered lists (ol and ul) as HTML structures for organizing and displaying data.
- Acquire and parse HTML documents from web servers, using:
GET for a web page,
POST for custom access to a web page based on a form.
- Use combinations of XPath and procedural traversal to construct useable data sets from the most common HTML structures used for rendering data:
HTML tables,
HTML lists.
The HyperText Markup Language (HTML ) is the format used for web pages in the Internet. The format provides for the logical structure of information for navigation, formatting, incorporation of pictures and other media, and structure of web pages, as well as for linking to other web pages and resources both local and remote relative to the web server providing the content. Web pages can display data in various forms, and while the intent of the web page may have been for the presentation of data, it can also serve as a data source in the context of our Data Systems.
The term web scraping refers to the programmatic access of the HTML of web pages and the extraction of useful information contained therein, even if the intent was for presentation of information and not the providing of data. We should note, at the onset, that data incorporated into web pages is the intellectual property of some owner of the data, often affiliated with the web server provider. As such, there are ethical and legal considerations whenever we decide to acquire data in this manner. The intellectual owner of data has the authority and right to determine the valid use of the data, or its acquisition through these programmatic means. Just because we can acquire data through web scraping techniques, does not mean that it is ethical or legal, based on the wishes of the owner of that data.
22.1 HTML Structure and Its Representation of Data Sets
Like XML, HTML can define a tree representing the information for a web page. Unlike XML, HTML is a specific grammar . This means that the tags and many of the allowed relationships between parents and children are well defined and have a semantic meaning associated with them.1 For instance, all HTML documents have a root node of html, and its children can only be head and body. Under head, one of the most common children is title, used in rendering to show the title of the web page, often in a browser tab. Often the children of head include nodes for metadata and scripts, which are code segments to be run within the context of the web browser when the page is loaded. The children of head can also define link elements to tie the page to specific formatting styles.
The body of an HTML document subdivides a web page using div nodes, which are structuring elements. These often have class and id attributes to distinguish their use. Programs that can generate web pages for developers use the structuring mechanism of div nodes to provide, within the child structure of the div, elements for navigation and separation of the objects developers can place into a web page.
The body also often has nodes that represent various levels of headers (in nodes tagged h1 through h6), and these are used to give, in the rendered page, sections, subsections, etc., for the document. We can use the p node to define paragraphs within the page, and define figures and captions. Nodes labeled span and with class attributes are used for grouping of underlying page elements and are often used to impose consistent styling for the rendering of elements.
The text below shows a minimal, but complete, HTML document that is fully capable of being rendered by a web browser. It includes a number of the elements described above.
In its original definition, HTML was not as strict as XML and the associated set of rules governing well-formedness that we saw in Chap. 17. For instance, a paragraph element and its start tag, <p>, was not constrained to have an end tag </p>. There were many such instances that, taken together, means that a single unambiguous tree could not always be parsed and constructed from an HTML source. Specific web browsers, like Chrome and Explorer, would define default heuristics for processing and rendering such cases, and web developers adapted to the rendering choices of the target browsers. As HTML has evolved, newer versions, such as XHTML , have addressed these potential problems by defining a strict version of HTML that adheres to XML well-formedness and is, in fact, defined as an XML Application [73].
The reason for pointing out the insufficiency of some HTML to follow a strict tree definition (in part, by omitting some end tags) is a practical one. In the space of performing web scraping, we must take HTML that has been developed by external entities and is hence out of our control, and perhaps not intended for client application processing, and be able to parse it into a tree structure like we did with XML. Depending on how “broken” the HTML may be, we have some programmatic solutions to build a best-effort tree from the HTML so that we can proceed with our web scraping task.
The process of web scraping involves, first, one of discovery , wherein we use tools to acquire the HTML and then to ascertain its structure and, in particular, determine how the desired data within the page is represented in the HTML, and second, one of programmatic extraction , wherein we use our techniques from XML, including both XPath and procedural programming, to process the HTML tree and extract the data, transforming into a usable result, such as one or more data frames.
The remaining subsections describe the two most common ways of representing data sets in an HTML document. Understanding some of the possible HTML representations of data on a web page allows us to proceed through our discovery phase of web scraping. The HTML elements representing a data set would be subordinate in the tree and specific location based on the overall structure. For example, in Fig. 22.1, these data set-containing HTML elements would be present in the subtree rooted at the collapsed div at line 38.
22.1.1 HTML Tables
Fortunately, whether a table is simple and unadorned, as illustrated above, or uses borders and shading and other more complex formatting, the same table HTML structure is used, the difference coming from style sheets,2 class attributes, and possible subordinate structure in the definition of table header (th) and table data (td) elements.
22.1.2 HTML Lists
In this particular example, the children of the li tags are a set of span elements. In practice, one often sees div elements as children of li elements as well. One could also imagine, for each li defining a row, a nested list structure with another ordered list (ol) or perhaps an unordered list (ul), e.g., if items in a recipe list had ordered sub-items labeled a, b, c, etc.
Common HTML tags
Tag | Description |
---|---|
a | Defines a hyperlink, which is used to link from one page to another, or to a reference point within the current page. The href attribute defines the link itself, while the text gives the displayed version |
body | Defines the body contents of an HTML document and is a child of the html root |
div | Defines a division or container for an HTML document to allow Structure, and may be nested to group children as needed. The id attribute is used to identify elements, and the class attribute is also used for categorizing containers and for common styling |
form | Defines an HTML form for user input. Some possible children include label, select, and input |
h1 to h6 | Defines various levels of HTML headings |
head | Child of the html root, and used to define the metadata for an HTML document before the definition of the body |
html | Defines the root of an HTML tree |
input | Defines an input field as part of a form, where user can enter data |
label | Defines a label within a form for, among others, input and for select elements |
li | Defines a list item, and is a child of ordered lists, ol, and unordered lists, ul |
link | Relates current document to an external resource, often for style information |
meta | Defines specific metadata in the head of an HTML document |
ol | Defines an ordered list structure. Children are li elements |
option | Defines children of a select as dropdown items in the select list |
p | Defines a paragraph |
script | Used to embed a client-side script of JavaScript code |
select | |
span | Defines an inline container used to mark up a part of a document, and is finer grained than a div type of container |
table | Defines an HTML table structure, with possible children thead and tbody, or directly to tr rows |
tbody | Defines the body of a table and, if present, is a child of table |
td | Defines a cell in a table |
th | Defines a header cell in a table |
thead | Defines the header potion of a table and, if present, is a child of table |
title | Defines the title of the document to be shown in the browser’s title bar or tab |
tr | Defines a row in a table |
ul | Defines an unordered list structure. Children are li elements |
22.1.3 Reading Questions
Web scraping has received much media attention in recent years. Please share an interesting real-world application of web scraping that you are aware of.
In the reading, the data sought is located in a table under the h2 header. Will this always be the case? Either justify a “yes” answer or find an example online to demonstrate a “no” answer.
The reading discusses the evolution of HTML pages over time. It would be a mistake to only learn the newest form of HTML, e.g., because one day you may need to web scrape data on a web page created a long time ago. Indeed, you can even web scrape pages that no longer appear on the Internet, because the “way back machine” stores copies of many web pages. Please explore https://archive.org/web/web.php and think of a real-world example where you would want to access an old copy of a web page, or a web page that was taken down.
Given that web pages can be changed, please discuss best principles for organizations hosting data that a client may want to scrape, and discuss what a client should do before attempting to scrape a page with old code.
The first figure in the reading shows that in HTML it is possible for two or more tags to appear on the same line, meaning you have to read an HTML document carefully to discover the tree structure. A nice way to see the HTML text of a web page is to open the page in Chrome, go to “View” -> “Developer” -> “View Source.” Please follow this procedure with the web page http://personal.denison.edu/~whiteda/math401spring2016.html
- 1.
What tag is used to start and end the table?
- 2.
Does this table have an HTML table header tag?
- 3.
What XPath expression could you use to find the number of rows of the table, and what is the real-world meaning of this number?
Similar to the question above, please “View Source” on this page: http://datasystems.denison.edu/data/ and determine the tag name associated to each of the links to data sets listed on the page. How could you determine the number of data sets listed on the page? Is it wise to find and count all elements on the page with the tag you just discovered? Explain.
The reading mentions “ethical and legal considerations” involved with web scraping. Please think up and describe a situation where, even though a web page is publicly available, it would be unethical to scrape data from the web page.
The reading mentions “ethical and legal considerations” involved with web scraping. Please describe what you would do in order to determine if it was legal to scrape a web page. Hint: you might consider using a search engine to search for “robots.txt”
When data is hosted on a publicly available web page, who should be considered the owner of that data? Are you the owner simply because you are looking at the data? This is a difficult question, but please consider it and make an argument in favor of your position. It may help to think about how you would feel if the data consisted of photos of you.
Review the Terms of Service for Yelp, and additionally search their support pages. Is it legal/permitted to web scrape Yelp pages? Provide specific link references to support your answer.
Review the Terms of Service for billboard.com and additionally search their support pages. Is it legal/permitted to web scrape, say, the “Hot 100” page? Provide specific link references to support your answer.
Suppose a web page has information you want to use, but you cannot determine if you are allowed to scrape the web page. Please look up the term “fair use” and describe what you learned and how it might affect you in a situation like this.
If you blindly web scrape, e.g., using the app SiteSucker to download all files on a given web portal, then you could easily end up possessing files that are illegal to possess (e.g., child pornography). Conversely, it is also possible to view illegal materials without downloading them. Please discuss how you believe the law should cope with these dual situations.
22.2 Web Scraping Examples
We now proceed through a graduated set of examples of web scraping. We start with a simple table, but one that occurs within a realistic web development that uses an application for generating web pages. This is, in fact, the example from Figs. 22.2 and 22.3 above. We then proceed to a table embedded in a publicly available site in Wikipedia. These two examples both use the HTML table structure to represent their data. The exercises will allow the reader to explore web scraping of a nested list structure. Our final example will explore a web page whose content is determined from a simple HTML form, and whose data is acquired through a POST operation.
Each example will describe parts of the discovery process, by which an individual determines the structure used to represent the data on the web page, and the data extraction through the application of the procedural and declarative steps using XML/XPath operations (see Chap. 16).
Before we proceed with the examples, we discuss some considerations that apply to when we retrieve HTML in a request that can make it slightly different than the (mostly similar) techniques of acquiring XML data as explored in Sect. 21.4.
22.2.1 Formulating Requests for HTML
In Sect. 22.1, we described how HTML, as produced by a web developer, might not follow strict tree formatting. Nonetheless, to accomplish web scraping, we need a tree that we can use for both discovery and for programmatic extraction of data. So, instead of using the default ElementTree parser, or specifying a custom XML parser , we define a parser that can take many forms of “broken” HTML and parse into a properly formed ElementTree.
All unclosed end tags have been closed, and the <h1> was closed with a </h1>. The parser uses known structure and common nesting to make a best-effort construction of a tree from the HTML input.
There are other terrific packages out there that specifically support web scraping. For instance, we highly recommend the package “Beautiful Soup” [61]. Because of the commonality in what we are doing here with our earlier treatment of XML, we chose, in this book, to continue to use lxml.
In the examples that follow, we will use the requests module to make HTTP requests. The first set of examples will use GET requests, and the last will develop the use of a POST request in order to obtain the HTML that will act as a source for our web scraping. We will again use functions in our custom util module for building URLs and printing results.
22.2.2 Simple Table
Our first example entails an HTML table on the page "/ind2016.html" at datasystems.denison.edu. This page is the same one discussed in Sect. 22.1.1 with a straightforward structure, and little that “adorns” the presentation of the table. Our goal is to acquire the page, discover how the data set is represented, and to extract the data into a row and column data frame.
In this case, as shown in the prefix printout shown above, and in greater detail in Fig. 22.1, the structure of the HTML has a head and a body, and in the body, we have significant nesting of div nodes.
If a web page were to have multiple tables contained within the body of the HTML, our discovery step would need to be more involved, possibly printing parts of the structure of multiple tables returned from the nodeset = root1.xpath( "/html/body/div//table") and determining which one contained the desired data.
Many other variations beyond a simple table like this one are possible. We will see one such variations in our example of Sect. 22.2.3. One of the more common ways in which variations occur is when a data cell (in a td element) has additional structure. For formatting, the td might have a span child. Or part of the data in the cell might be incorporated into a link to another web page or to a relative link within the current web page. In this case, the data might be part of the text of the link element .
22.2.3 Wikipedia Table
22.2.3.1 Goal
We see in the rendered page a table of state populations . Population data and ranks are relative to the 2010 census and population estimates for 2019. In our case, we are interested in the most recent data, even if it is an estimate, and so we want to extract the current rank (as an integer), the string of the name of the state (we do not care about the state flag picture), and the estimate of the population as of July 1, 2019. These are the first, third, and fourth columns in the table.
Before engaging in web scraping, a developer must look at the acceptable use policy of a provider and also look at their policy on automatic access to their pages. In this case, such an investigation revealed that programmatic access should go through a defined API and that requests can be limited in number and frequency.4
22.2.3.2 Discovery
The data for this row of the table is contained in the td elements, and this first row corresponds to California, whose first and second columns have value 1, and the third field has a picture and a link whose label is California.
- 1.
The current state rank is in first td in row and is the text of a span node under the td Element. It is not always the case that the first visible table entry corresponds to the first td node in a row. Some tables can use additional td elements in the set of rows, used for spacing, borders, and other rendering results.
- 2.
The state rank at the last census in 2010 is in the second td in the row; we will disregard this field based on our goals of wanting the most recent estimate.
- 3.
The third td contains the state information, and we will explore that further below.
- 4.
The fourth td contains the estimated population in 2019, and here the value is directly in the text of the td element.
So we will use positioning to get the columns (as the relative td within the row) that we are interested in.
We observe that the name of the state is embedded in a hyperlink , given in the node with tag: a. This node is beneath the td; the name of the state is in the text property of the a node.
22.2.3.3 Data Extraction
Understanding the tree structure of the table, we can now acquire the data for columns 1, 3, and 4 of the table using XPath. We know that the data-carrying rows begin after row position 2, and because the table has a row for the District of Columbia, we want to extract data from 51 rows. We do not want to go further, as these rows contain information on territories and have aggregate data.
22.2.4 POST to Submit a Form
22.2.4.1 Goal
By using this mechanism, a web scraping client can design and construct a request and then receive the resultant HTML. With the result, we can then process the constituent table, ul, or ol data and extract it into a usable form. This section will demonstrate this technique for the CA Gas Prices web page.
22.2.4.2 Discovery
Given a web page with a form like in our example, our discovery must determine the specifics about the form and how we could construct a client-based POST that is able to make a request so that the target web server would respond and so that the client can acquire the HTML-based data.
In this particular example, through examination of the HTML, we can discover the specification of the form on the web page:
The <select> tag defines the dropdown, with <option> entries for each of the items displayed. The name and id attributes both have value year, and this will be the field name used for one of the entries in the form, and the user-selected value is determined by the value attribute for a selection.
The <input> tag defines the submission button, where the newYear will be the second field name of the form, and this entry will have the value given by the value attribute, so the form entry newYear will be associated with Get different year.
The <form> tag defines the overall operation of the form, using action and method to indicate that, on a submission, the result should be a POST request to the resource path given by action, which here is index_cms.php. In this case, this means that the POST of the form is back to the same web page as the original GET.
A discovery process often uses examination of the HTML and couples that with using a web browser and associated developer tools, looking at network message interaction, to interact with the form and experimentally determine what happens when a user selects a year and clicks the Get different year button.
- 1.
The page itself, in HTML, has our desired data by week, and each week is a separate table object, with a tr (row) for each of the variables and associated values for that week.
- 2.
The form consists of two entries:
year field maps to the four digit string of the desired year and
newYearfield maps to the constant value Get different year
- 3.
The HTTP method is POST, which means that the form entries should be in the body of the POST as URL-encoded field=value for each field, separated, per URL encoding , with an ampersand (&).
- 4.
The resource path/URI of the POST is the same resource path as the original.
22.2.4.3 Request and Data Extraction
- 1.
We must get a POST request instead of a GET request.
- 2.
The request must include a body that consists of key-value pairs.
Note the specification in the request headers, through Content-Type, indicates that the body of the post is a URL-encoded form.
Given that each table represents a single week and that the rows in the table represent variables, then each table will give us a single row for a table representing the data of the page. With an eye toward collecting a list of dictionaries for construction of the table, we will develop processing of one table to result in one (row) dictionary.
We can see from the print of the tree that the first piece of data needed, the date, is in a caption child of the table. Let us postulate data columns:
22.2.5 Reading Questions
Please think up a reason why there might be a lot of broken HTML on the web (i.e., malformed from an XML perspective).
Please justify the value of the print_xml( ) function to a working data scientist. What would life be like if we did not have this function?
The reading says “Before engaging in web scraping, a developer must look at the acceptable use policy of a provider and also look at their policy on automatic access to their pages.” Please find an example of an “acceptable use policy” and “policy on automatic access” and give your example here, including the link to it. In general, how can you find this kind of information?
Please study the California gasoline web page used in the POST example and explain what is meant by “Branded” versus “Unbranded” in real-world terms: https://ww2.energy.ca.gov/almanac/transportation_data/gasoline/margins/index_cms.php
In the first XPath code shown in the reading, explain why we use
instead of
With reference to the content printed by print_xml, please explain carefully why the XPath expression ./thead/tr/th/text( ) yields a list of column names.
In the ind2016 example, consider the line
Please describe tdlist carefully after this line is executed. For instance, how big is this list? What is the type of the items in tdlist? How does it relate to the table?
In the code for the LoL solution of extracting ind2016 data, what is the purpose of fieldcount?
In the DoL approach, what is the point of the format string in the following snippet of code?
The reading invoked state_column = table.xpath( xs) for the XPath expression below
xs = “.//tr[position() > 2 and position() < 54]/td[3]//a/text()”
Can you think of a way to rewrite this to avoid the use of the position function? You might consider consulting the reading to make sure you understand what this expression is trying to accomplish.
The reading shows two ways of building a pandas data frame from a web page: using either a LoL or DoL. Please write a sentence describing how each approach works, then answer: which do you find more intuitive and why?
In what way, specifically, does the XPath expression in the reading "./tr[position( ) >1]/td[1]/text( ) " extract the branded rather than unbranded data?
Consider the XPath expression used to extract the list of states. Suppose you wanted to extract the list of links (one per state), e.g., for use in a web crawling program. What XPath expression would you use?
Instead of a list comprehension applying the lambda function to create pop_column, could we use map? Explain.
When describing the POST, the reading discusses the resource path given by action. Can you think of an example where a POST would want to send data to a different resource path than the page the user is currently on?
Explain the use of the XPath expression
by referencing the HTML structure of the web page in question.
22.2.6 Exercises
Write a function
that performs an HTTP GET for the specified resource at location, using protocol, which defaults to https. If params is specified, these should be used as a dictionary for query parameters for the request. If the GET is successful, the function should verify that the result is HTML. If that is true, parse the HTML and return the root of the tree. If the resource is not found, or if the resource is not HTML, or the resource could not parse, return None.
Write a function
that retrieves the HTML tree from resource at host location, using the specified protocol, and then finds all the external hyperlinks referenced in the document. Return a list of strings for the URLs. Recall that a hyperlink is referenced with the tag: a, and the href attribute within that tag contains the link itself, as opposed to the displayed text. Since many links in a document can be internal, we want to restrict to those that contain a URL, not a URI. Bonus points if the algorithm only returns the unique external links.
What is the first ordered or unordered list found in the tree? If you wanted that specific list, and wanted to protect against another list being added as the first one, what steps would you take to get the desired list?
What would the XPath be to get the div whose class attribute is "RichTextElement" that is a descendent of the div whose id attribute is "main-content"
Given the (single) node that results from answering the previous question, write a single XPath expression that would get the set of strings naming the available databases.
Repeat the above, but use procedural steps to accumulate the list of databases.
In English, describe what you learned about the HTML structures involved in this collection of information. Is this the same as one of the structural options considered in the chapter?
Consider the page: https://datasystems.denison.edu/ind2016_list.html. On that page is a list-structured data that consists, at the outer level, of an ordered list, and at the inner level, an unordered list of indicator values. It also has some adornment in font face, like bolding and italics. Using the techniques of this chapter, write the code to scrape this page and create a pandas data frame containing rows for each of the five countries, and columns for code, name, gdp, pop, and life.
Consider the page: http://datasystems.denison.edu/ind0.html, which is the page described in the book, which contains triple-nested lists. Note also that all the lists are unordered. Write code to web scrape and create a pandas data frame with three rows, one each for FRA, GBR, and USA, and has a code and four data columns, for each of pop and gdp for years 2007 and 2017. The best solution would use a two-level column index.
The last two exercises ask the reader to scrape tables on Wikipedia. We would ask that you adhere to the acceptable use policy and use the Wikipedia API to access the pages. We also caution that websites not under our control are constantly changing, and while, at the time of this writing, these exercises were feasible, changes could make adaptation necessary.
At the Wikipedia, page called
is a table of novels considered great by one or more experts. Using the resource path prefixed with "/api/rest_v1/page/html/", obtain the HTML tree from that page and scrape the page, building a pandas data frame of the results. You are to design the table as you see fit for the columns to be included. Make sure your solution is robust to the case of additional novels being added to this list.
At the Wikipedia, page called
is a table containing, by week, the top songs from the year 1960. Using the resource path prefixed with "/api/rest_v1/page/html/", obtain the HTML tree from that page and scrape the page, building a pandas data frame of the results. You are to design the table as you see fit for the columns to be included. This exercise is slightly more difficult than the previous one, because the table has entries that “span” more than one row, when the same song is top for multiple weeks.