Chapter 5 Aggregating Excel Worksheets

So far, each of the techniques demonstrated for appending data has been based on pulling data from external CSV or TXT files. As any Excel pro knows, however, the world of data is much bigger than just these types of files. There is a huge amount of data stored inside Excel files as well, and you need the ability to combine those data sources together. For example, you might want to be able to append all tables in a workbook. Or you might want to be able to append all worksheets in a workbook (where tables haven’t been set up). And, naturally, you need to be able to accomplish these goals from within the active workbook or from an external one.

While the good news is that these feats can all be accomplished, the bad news is that there is no easy way to accomplish these tasks via the standard user interface. Each of them requires some minor manipulation of the Power Query formula bar—although nothing more serious than what you’ve seen so far.

It’s important to understand here that the methods for dealing with data in your active workbook are quite different from those for pulling data from an external file.

Aggregating Tables and Ranges in the Current File

Pulling data from within the current file works like this:

This process gives you access to the following objects:

Unfortunately, this process doesn’t allow you to enumerate all worksheets in the file.

Consolidating Tables

Let’s look at some specific examples, starting with the Ch05 Example Files\Consolidate Tables-Start.xlsx.

In this file you’ll find three worksheets that list gift certificates issued by a spa for the current month. The clerk who created the workbook never added any issue dates to the certificates, although she did dutifully create table names in the format Jan_2008, Feb_2008, and so on.

In order to analyze these transactions, the data obviously needs to be consolidated, and Power Query is the perfect tool for doing that. Let’s get started:

You now see your list of tables:

Figure 61 The list of tables in the current workbook.

As you learned in Chapter 4, you can click in the whitespace beside the dark green words in the Content column to preview the data. You also know that if you click the double-headed arrow at the top of that column, you’ll be able to expand the columns of the tables without losing the Name column’s details. Here’s what you do:

The data expands nicely, keeping the name column in place, allowing you to continue on and convert the table names into dates:

The completed query now looks as follows:

Figure 62 Completed query, ready to go.

Everything looks good here, but when you choose Close & Load, you trigger a bunch of errors:

Figure 63 It looked so good! How can there be 62 errors?

So what happened? Go back and step through the query by right-clicking the Gift Certificate query → Edit → Select the Source step. At this point, you’ll notice that you’ve got one more table listed than you did earlier. It’s the Gift Certificates table that was created as an output from this query!

Figure 64 The finalized query appears in the list of tables.

Warning: When using =Excel.CurrentWorkbook() to enumerate tables or ranges, remember that the output query will also be recognized upon refresh. To deal with this, some future-proofing steps may be required, depending on how you built your query. These steps could include filtering to avoid potential duplication or errors, changing data types, or changing anything else that you set up in advance to guard against the possibilities of your query breaking.

You should now step through each step of the query, paying attention to what happens.

When you get to the Replaced Value step, scroll down to row 63. Do you notice anything dangerous unfolding here?

Figure 65 Problems arising in the values to be converted to dates.

When the Changed Types step is evaluated next, it tries to convert all data in the Name column to dates, which it obviously can’t do for Gift 1, Certificates. Instead, it places an error value in the column for each cell that contained that text:

Figure 66 Invalid dates converted into errors.

This issue actually works in your favor, as everything from the Gift Certificates table is a duplication of the data. With those rows throwing errors, you can simply filter them out:

Figure 67 There are 62 rows loaded from 3 tables.

With the filtering done, you get a positive result from Power Query loading only 62 rows of data—with no errors.

Consolidating Ranges and Worksheets

Now what if the worksheets didn’t have tables, but the clerk named the worksheets instead? In that case, could you consolidate all the worksheets? Yes, but as mentioned before, there is no built-in facility to do this. Instead, you have to make use of the ability to talk to named ranges by calling a specific named range.

The trick is to define a print area. Why? Because the print area’s name is, in fact, a dynamic range name. So to pull a print range into Power Query, you need to follow the steps to connect to a dynamic named range, as explained in Chapter 1:

You now see a list of all the tables and named ranges, including the print areas!

Figure 68 Excel.CurrentWorkbook() showing the print areas.

You currently have both tables and the print areas, but you can filter this down and expand it to see what you get:

Notice that things are different here. You have the entire worksheet, including each column in your print range:

Figure 69 A raw look at the worksheet.

This obviously means that more data cleanup needs to be done in order to aggregate these ranges and turn them into clean tables, but the good news is that it can be done. To clean up this particular data set, follow these steps:

When working with print areas, it is a good idea to try to restrict a print area to only the rows and columns you need—for two reasons. The first reason is that it takes Power Query longer to process more data points. The second reason is that each column is just referred to as Column# in the expansion window, which makes it very easy to pull in extraneous columns that just need to be removed later.

Aggregating Data from Other Workbooks

Say that you are building a solution that relies on data stored in other Excel workbooks. In this case, the user is storing data in worksheets named with the month and year, but only one quarter per workbook. Those workbooks are stored in a folder on the network, and you need to consolidate the data within them.

The methods used earlier in this chapter have been targeted at the objects inside the current workbook, and they don’t work when you’re talking to external workbooks. Instead, you actually need to generate a list of Excel workbooks and extract their contents, similar to what you did in Chapter 4 when you extracted the contents of CSV files.

Sourcing the Data

You start the data sourcing process from a blank workbook:

When Power Query lists the files, you can see that you already have several files in here that are non-Excel files:

Figure 70 Non-Excel files mixed in with the files you actually want.

You need to filter those files out of the list, as you only want Excel workbooks:

At this point you might be tempted to click the Combine Binaries button, and unfortunately Power Query will let you do it. If you do, though, you’ll find that Power Query combines the files but can’t import them:

Figure 71 Power Query says it can’t combine Excel binaries.

Note: If you actually clicked the Combine Binaries button, you now need to go to the Applied Steps box and remove both the Imported Excel and Combined Binaries steps.

So if you can’t combine and import the files using the easy method, you have to do it the hard way:

Your new custom column now contains a column of tables, which contain each of the objects you can connect to—including all Excel tables, named ranges, and even worksheets in the Excel file:

Figure 72 Previewing the parts of the Named Range.xlsx file.

Expanding the Excel File Contents

The Custom column has the double-headed arrow (the expand icon) in its header, so it can be expanded. When you click on the expand icon on the Custom column, you get a list of all the different workbook parts:

Figure 73 All workbook parts expanded.

At this point, it is critical that you recognize what you have here. The Kind column shows that you have Sheet, DefinedName, and Table objects. Remember, however, that you can define names over specific cells in a table and/or worksheet and that tables exist in worksheets. It is very easy to duplicate data in your import if you don’t filter this area appropriately.

Warning: Importing objects of different types as listed in the Kind column could lead to duplication of data!

You want to extract the worksheet data only, so follow these steps:

The query now looks as follows:

Figure 74 Excel data imported from an External workbook.

Transforming the Data

Obviously, the data in this example still needs some cleanup. You should do that now and finalize the query:

The data is loaded into a table, so you can create a PivotTable configured as follows in order to see what you pulled in. Use these settings:

Here you can see that you’ve successfully pulled data from two different files and three worksheets within each, for a total of more than 12,000 records:

Figure 75 The results of your hard work.

Considerations

When working through the techniques described in this chapter, there are a couple things you need to keep in mind.

Using =Excel.CurrentWorkbook()

The biggest thing to remember about building solutions using the Excel.CurrentWorkbook() function is that this function reads all the objects in the current file. Since this affects the calculation chain, you get hit with a recursion effect, meaning that as the new tables are built, Power Query recognizes them and reads them as potential content as well.

The implications of this appear at refresh, when the query attempts to load itself, thereby duplicating the data in the output. When working with this method, it is important to remember this and guard against it. Strategies to protect against problems here range from filtering errors on key columns to using naming standards for both your input and output columns so you can filter out the ones you don’t need.

Note: Whatever method you choose, make sure to test it through a couple refreshes before releasing it to production!

Using =Excel.Workbook([Content])

Unlike with Excel.CurrentWorkbook(), using the Excel.Workbook() function doesn’t cause any recursion issues. The reason is that Excel.Workbook() reads from the most recently saved copy of an external Excel workbook. This can obviously have implications if a user is in the process of updating the file but hasn’t saved it, as you won’t get the most current data that you might expect.

The other key concern when using the Excel.Workbook() function to extract the contents of a workbook is that it pulls worksheets in addition to ranges and tables. This makes it very easy to duplicate data if you are not careful, as all named ranges and tables exist in worksheets. Pay careful attention to filtering the Kind column to avoid this issue.

Note: Even if there is only one kind of data in the workbook when you build the solution, it’s not a bad idea to future-proof your solution by forcing a filter to accept only the desired kind. This will prevent issues when a user adds a table to a data file that never had one.

It should also be recognized that it is entirely possible to use Excel.Workbook() to read from the current workbook, thereby exposing the ability to read worksheet content. (This is useful if you can’t define print ranges or other ranges over your data.) Remember, however, that the query will read from the most recently saved copy of the workbook, not the live copy. This has the unfortunate side effect of still facing the recursion issues from the Excel.CurrentWorkbook() function, but you may not see them until the next time you open the file.