MDataMonkeyeBookNoIconsForEPub-14

Chapter 12 Importing Web Data

One of the really interesting use cases for Power Query is when leveraging it to pull data relevant to your business from the web.

Power Query works very well when website data is stored in tables, and on occasion it even lets you directly access a text file that is holding information. If the data isn’t formatted with table tags in the HTML code, however, things become very difficult, even if you have some HTML knowledge.

Connecting to Pages with Tables

For the example in this chapter, you will take the role of a marketing person who is trying to figure out the best countries in which to market a product. The primary statistic that you want to figure out up front is which 10 countries have the largest populations, as that seems to be a factor that could be reasonably construed to drive sales.

In order to get this information, you have been exploring the CIA World Factbook website, and you have come across a web page that lists the populations of the countries of the world. You would like to pull this data into Excel so that you can use it with other tables. Follow these steps:

Create a new query → From Other Sources → From Web

You are now prompted with a set of authentication options:

Figure 148 The web authentication dialog.

There are several options down the left side of the dialog, allowing you to choose from Windows or basic authentication, provide a Web API key, or even provide your Office 365 credentials. The important one to you at this point is the Anonymous option.

The question you are being asked here is if you’d like to use these credentials for the specific page or the root domain. While there are sometimes reasons to choose the specific page, chances are that you’re probably going to want to stick with the default—the root domain—in most cases. So ensure that the first option is selected and click Connect.

At this point, Power Query inspects the document, looking for tables. Once it has worked out the content it recognizes, it sends you to the Navigator pane. In this case, there are two options to choose from—Table 0 or Document:

Figure 149 Previewing Table 0 in the Navigator.

In this case, Power Query recognizes that the HTML document contains a table denoted by table tags in the page code. It presents that table for you in an easy-to-use format.

If you select the Document option, you see that it doesn’t look nearly as pretty:

Figure 150 No good can come from this.

Note: The reality here is that it is going to be much more difficult to extract your data via the document interface than via a table.

Fortunately, the data looked pretty good in the preview of the Table 0 table. It was already organized in a beautiful table format, with headers in place. It was even already sorted into ascending order, just as it was in the web page. Your real goal is to filter to get the top 10 countries, so follow these steps:

Select Table 0 → Edit
Rename the query Top 10 Population
Go to Home → Keep Rows → Keep Top Rows → 10 → OK

This is what you have now:

Figure 151 You have your top 10 countries.

The query is now finished, so you can finalize it by selecting Home → Close & Load.

Connecting to Web-Hosted Files

As it happens, the World Factbook page actually makes use of a text file to feed the data displayed in the web page.

Now, if you happen to know the URL to the text file, can you connect directly to it? The answer is yes, and here’s how it works:

Create a new query → From Other Sources → From Web

Enter the URL as follows: https://www.cia.gov/library/publications/the-world-factbook/rankorder/rawdata_2119.txt
Click OK

You are launched directly into the Power Query editor.

Note: Notice this time that you bypass the website authentication step. The reason for this is that you specified that you wanted the authentication settings to apply to the website root (www.cia.gov) when you created the previous query. Had you not done that earlier, you would have been prompted to provide the authentication method.

Figure 152 Connecting directly to a text file over the web.

Based on the data you get back, you can assume that the column headers are not stored in the text file but rather were provided by the HTML of the web page. This is no big deal though, as you can fix this to make your data consistent with the previous experience:

Go to Home → Keep Rows → Keep Top Rows → 10 → OK
Right-click Column1 → Rename → Rank
Right-click Column2 → Rename → Country
Right-click Column3 → Rename → Population
Rename the query From Text

It appears that the estimate date was also provided by the HTML page, not the text file, but since that’s not a huge deal to you, you’re not worried about it.

The truly important thing here is that you can connect directly to a file hosted on the web without going through the whole web page interface . . . if you know where it is.

Connecting to Pages Without Tables

If the web page you’re connecting to doesn’t contain table tags in the HTML source code, you’re left with a horrendous experience of trying to drill into the HTML elements. This experience is about as much fun as navigating a subterranean labyrinth using a candle for light, where every signpost simply says “This way out.”

The best way to get help in this situation is to open your web browser, turn on the developer tools, and try to find the element you want to extract. The following is the element inspector in Chrome (which you open by pressing F12) for the CIA web page:

Figure 153 HTML hell.

The trick to this is to first expand the <html class tag at the top. You then mouse over every arrow and watch what it highlights in the main document window. When the highlighting covers your table, you expand that element by clicking on the arrow that points right. (In this case, <body> needs to be expanded.)

The arrow then rotates down, exposing more elements, and you continue the process until you find your data. Next, you expand <div class=”main-block”> and then <section id=”main”> and then the second div class, and so on.

If you accidentally travel into a hole that doesn’t have your data, you go back up one level and collapse that block of code by clicking the arrow. This rotates the arrow back from pointing down to pointing right, and it collapses the elements contained within that tag.

Once you have navigated through the process and found your data, you can begin the painful second portion: replicating the navigation in Power Query. Here’s what you do:

Create a new query → From Other Sources → From Web

Enter the URL you used for the first example: https://www.cia.gov/library/publications/the-world-factbook/rankorder/2119rank.html
Click OK → Document → Edit

You’re now looking at this rather unfriendly view in the Power Query editor:

Figure 154 A most uninspiring view.

Now you need to very carefully replicate the steps you took in the web developer interface, drilling into Power Query’s corresponding table element. There are some parallels between the two programs to help, but even so, it is easy to get lost.

The trick to navigating this process is to recognize that the Name field in Power Query contains the element shown in the web developer tools. In this case, you have HTML, and in Chrome you saw <html class at the top. These two items are the same.

Click on Table in the Children column to drill into it:

Figure 155 Children of the HTML element.

You now see the HEAD and BODY tags. Based on the HTML you expanded, you know you need to drill into the Body tag. You click the Table there and keep going.

The killer with this process is that in the HTML, the tags all have names, but in Power Query you don’t see them, so it’s very easy to get lost. In addition, the Applied Steps box doesn’t trace the route; it just keeps combining steps together, giving you no way to back up one level. When that happens, your only recourse is to start over again from the beginning.

And as if that weren’t bad enough, at the end of this navigation process, you end up extracting columns to drill into the detail, and it ends up stacked in a vertical table:

Figure 156 So much for a nice clean table!

The steps to make this table into a nice clean one are beyond the scope of this chapter, so we are going to abandon this approach at this point. (Rest assured, the steps are covered, just not until Chapter 15.)

The steps to complete this process have, however, been saved in the completed example, which can be found at Ch12 Examples\Importing Web Data – Complete.xlsx. This particular query has been saved as TheHardWay. Even with that query to review, you’ll need to recognize that the Navigation step was generated as documented below.

Starting from the initial table:

Figure 157 Starting the journey into HTML Hell.

Drill into Table in the Children column for:

HTML (row 1)
Body (row 3)
DIV (row 4)
The 1st DIV (row 4)
SECTION (row 2)
The 2nd DIV (row 4)
The 2nd DIV (row 4)
ARTICLE (row 4)
DIV (row 4)
The 5th DIV (row 12)
TABLE (row 2)
TBODY (row 2)

If you follow this drill-down carefully, you’ll see that you drilled into exactly the same place as displayed in the Navigation step of the TheHardWay query, and you can follow the rest of the steps through to the end.

The job of drilling into the HTML document can be done, which is better than the alternative. However, it is not for the faint of heart and can be an incredibly frustrating process.

Caveats and Frustrations with the From Web Experience

The From Web experience is certainly a weakness in Power Query’s otherwise incredible arsenal of tools. There are several things that we hope could be improved in this area, and there are things to watch out for as you develop solutions based on web data.

None of the factors discussed below should be seen as reasons not to develop solutions based on website data. Instead, they are intended to make sure you go into this area with both eyes open, seeing not only the benefits but also the risks of relying on web-sourced data you don’t control.

The Collecting Data Experience

Building solutions against web data can be a very painful experience in Power Query. If there are table tags in the HTML code, everything works well. If not, however, all bets are off.

Compare this to Excel’s classic Data → From Web experience, which kicks off a web browser that allows you to navigate the web to find the data you’re looking for. Power Query does not give you this facility but instead leaves you to navigate web pages using a different browser.

In itself, this may not seem like a big deal, except for the issue that it is very difficult to tell if the data you are seeing in the web browser is in an easy-to-use table or not, without looking at the code.

The challenge here is that you’ll grab the promising URL, drop it in the Power Query interface, and then spend a huge amount of time expanding columns, chasing each route down the rabbit hole of HTML tags, trying to find the data you’re looking for. Even if you do understand the web debugging tools, it still doesn’t make the job easy. Because Excel pros are not web developers, this is really an unnecessary burden that is being placed upon them.

Ideally, this experience should kick off a web browser, let you navigate to the page and select the data range you’d like to import, and then do the leg work of breaking the HTML down to do that for you. Until that happens, however, importing data that is not formatted in a tabular fashion will continue to be difficult.

Data Integrity

Another major concern with web data is the source and integrity of the data. Be cautious of connecting and importing data from sites such as Wikipedia or other sites that you don’t have a business relationship with.

While demos love to use Wikipedia as a great example, the reality is that relying on this site can be dangerous. The content is curated, but it can also be changed by users. Although the site makes a great effort to curate data, the information there is far from perfect and may not be entirely factual.

The other issue is how readily the data is updated. Imagine investing time in building a complicated query against a web page, only to find out that the owner/curator doesn’t update it on a timely basis. You need to be assured that when you refresh the data, the routine isn’t merely refreshing out-of-date data but rather that it’s refreshing current data. You’re likely to invest a significant amount of time and make business decisions assuming that the last refresh you did pulled the most recent data.

Solution Stability

There i s another very real concern when building your business logic against web sources that you do not own or control. Like your company, every other company that provides these feeds and pages wants to better serve its customers. Unfortunately for us, that doesn’t mean that they are interested in putting out a consistent experience that never changes. In fact, quite the opposite is true. They are also trying to update things, changing the web pages to add new bells or whistles, or make the sites more attractive to the human eye. This has the very real side effect of blowing queries apart, usually without notice, and oftentimes when you don’t have time to fix them.