Chapter 5: Power Query
Excel is incredibly powerful in the right hands. Experienced analysts have many tricks and techniques for massaging malformed data into useful data. As an experienced analyst, I (Bill) relish the fact that I can apply 13 crazy steps to make data useful. I love knowing that I can do something that most people using Excel could never do.
The team that developed Power Query wanted to level that field. Power Query is incredibly exciting. It is Microsoft’s downloadable data-shaping engine and add-in for Excel 2010 and 2013. And it is aimed at the Power Pivot “generation.” Power Query takes a whole class of “ugly Excel data” problems and gives people an intuitive interface for easily dealing with those problems.
Power Query was not developed by experienced analysts. If you had asked experienced analysts to build Power Query, they would have scoffed and said, “But we already can do all that stuff in Excel” and then rattled off the 13 arcane steps. Experienced analysts never would have given the world Power Query because we enjoy being able to do the impossible in Excel.
However, with Power Pivot, a whole bunch of SQL Server and SharePoint people started using Excel. They needed ways to deal with ugly data. And thus, Power Query was born.
As an experienced analyst, what is my reaction to Power Query? I love it. It turns out that I never really enjoyed doing the 13 arcane steps. I love that I can now replace those 13 steps with a few clicks.
Combining Multiple Worksheets or Workbooks into a Single Power Pivot Table
This trick doesn’t actually use Power Query, but it’s a really neat trick, and the next trick—which does use Power Query—builds on this one. Let’s say you have multiple worksheets (or workbooks) that all contain the same sort of data:
Figure 5.1
These worksheets all come to you separately, but you really want them to appear as one big table. So you need to combine all of them into a single Power Pivot table.
Naturally, if you’re working with a small number of sheets, and the sheets aren’t massive, you can just copy and paste them all into one table in Excel and then copy and paste them into Power Pivot, or you can link the table into Power Pivot or export it as a CSV file so you can import it elsewhere. One other technique would be to use Paste Append to directly paste into Power Pivot.
If the combined data set exceeds 1 million rows, you won’t be able to combine the sheets into one because you will exceed the worksheet row limit. And you can’t use Paste Append to paste a data set of that size into Power Pivot; pasting large data sets into Power Pivot takes forever, if it completes at all.
Here you’ll see how to tackle this problem.
Removing the Header Row
First, you need to delete the header row because you don’t want the header duplicated multiple times in the combined data set:
Figure 5.2
It’s also important to ensure that the data starts in cell A1:
Figure 5.3
Saving Each Sheet as a CSV File
When you save as CSV, Excel saves only the active sheet, so you have to repeat this for each sheet (or each workbook). This yields, for instance:
Figure 5.4
Now you have a bunch of individual CSV files.
Opening a Command Prompt
Next, you need to open a command prompt by selecting Start, All Programs, Accessories, Command Prompt:
Figure 5.5
A command prompt window appears:
Figure 5.6
Changing Directory to the Folder Containing Your CSV Files
Use these commands to change directory to the folder holding your CSV files:
Figure 5.7
Copying Them All into One File
Here’s the only real “trick” in this whole process. You enter copy *.csv combined.csv at the command prompt:
Figure 5.8
This command merges all your CSV files into a single one. This is what you then see in Windows Explorer, in the folder that contains your CSV files:
Figure 5.9
Importing the Combined CSV File into Power Pivot
Finally, you important the new combined CSV file into Power Pivot by opening Power Pivot and selecting Home, From Text:
Figure 5.10
Bonus: A Macro to Save All Workbooks in a Folder as CSV
If you have a bunch of workbooks in a folder and you need to save them all as CSV files, here’s a macro that will do it all for you:
Sub ProcessWorkbooksInFolder()
Dim sPath As String
Dim sFile As String
Dim sDir As String
Dim oWB As Workbook
Dim sNewPath As String
'***Set this to your folder of workbooks***
sPath = "C:\Users\rob\desktop\csvfiles"
If Right$(sPath, 1) <> Application.PathSeparator Then
sPath = sPath & Application.PathSeparator
End If
sDir = Dir$(sPath & "*.xlsx", vbNormal)
Do Until LenB(sDir) = 0
Set oWB = Workbooks.Open(sPath & sDir)
sNewPath = ActiveWorkbook.FullPath
sNewPath = Left(sNewPath, Len(sNewPath) - 4) & "csv"
ActiveWorkbook.SaveAs sNewPath, xlCSV
oWB.Close False
sDir = Dir$
Loop
End Sub
Note
If your goal is to combine several worksheets into a regular Pivot-cache PivotTable, check out Fazza’s amazing macro, available at
http://ppvt.pro/PIVOTMACRO2 .
Using Power Query to Merge CSV Files
To illustrate how exciting Power Query is, this trick shows you how to merge CSV files into one table by using Power Query rather than manually saving and merging files with the command prompt (as discussed earlier in this chapter). Just to make things a bit more difficult, let’s start out with three CSV files, with identical structures, that are stored on a website:
Figure 5.11
You don’t need to download the CSV files to your computer! You can leave the files where they are and build a completely automatic system—one that requires zero manual intervention on an ongoing basis.
Connecting to One of the CSV Files
Once Power Query is downloaded and installed, you get a Power Query ribbon tab, which has a From Web button on it (the Excel 2013 version is pictured here):
Figure 5.12
After you click this From Web button on the Power Query ribbon tab, enter the URL for one of the CSV files:
Figure 5.13
You now get a pop-up query window with its own ribbon:
Figure 5.14
Adding a Custom Column to “Tag” This File
Next, you need to insert a custom column and enter a static formula ( =1 ) to “tag” this file as containing the data for Territory 1:
Figure 5.15
Then you right-click the new custom column and rename it Territory:
Figure 5.16
Then you rename this query Terr1 and click Apply & Close:
Figure 5.17
Excel now shows the data like this:
Figure 5.18
This is not quite what you want. You want the data in Power Pivot, aka the data model. No sweat. Just edit the query you just wrote:
Figure 5.19
Next, you uncheck Load to Worksheet and check Load to Data Model, and then you click Apply & Close:
Figure 5.20
Now you see the data in the Power Pivot window instead of in Excel:
Figure 5.21
Connecting to the Second CSV File
To connect to the second CSV file, you use the same steps as for the first one, but with a different URL:
Figure 5.22
You also assign the custom column (Territory) a static value of 2 rather than 1 :
Figure 5.23
Connecting to the Third CSV File
As you might guess, connecting to the third CSV file is just like connecting to the first two, but with 3 instead of 2 or 1 in the custom column and query name.
After you connect to the third file, you have three queries defined in the workbook. The Workbook Queries pane now shows all three CSV queries:
Figure 5.24
Time for the Append!
Now, back on the Power Query ribbon, you click Append:
Figure 5.25
In the Append dialog, you specify Terr1 and Terr2:
Figure 5.26
The Append dialog lets you choose only two tables/queries, but that’s okay for now.
Next, Power Query asks you about privacy. Since these CSV files are on a 100% public website, you can select Public:
Figure 5.27
Append1 is a fine name for the new query:
Figure 5.28
But check out the formula bar. Can you modify that directly and add Terr3?
Yes, you absolutely can edit the formula yourself and create a three-table append—all in one step:
Figure 5.29
“Keeping” Only the Appended Query
You now have four tables in the Power Pivot window: the three original CSV queries, plus the Append result:
Figure 5.30
Although you have four tables, you really only need the one combined/appended table. Fixing the needless duplication is no problem. Just revisit the three original queries and uncheck both check boxes:
Figure 5.31
After you uncheck both boxes for all three original queries, click Apply & Close. Now you get just one table in Power Pivot—and, yes , it does contain all three tables combined:
Figure 5.32
Notice that there is now only one tab in Power Pivot. It has 7,387 rows, which is all three CSV files combined. And the Territories column, which did not exist in any of the CSV files, is present in the result, with all three different values for Territory.
Testing Refreshing
With only the final result table “landing” in Power Pivot, will Refresh know what to do? Test it to find out:
Figure 5.33
After you click Refresh (not even Refresh All), Power Query successfully reruns the three “child” queries, fetching all three CSV files and then appending them together:
Figure 5.34
Why Is This So Amazing?
This trick in Power Query makes for some very happy Excel pros for a few reasons:
Note
The command prompt solution from the beginning of this chapter avoids the multiple-formulas problem but introduces a manual refresh step, which may arguably be even worse.
Using Power Query to “Unpivot” a Table
Unpivoting a table is a very common “reshaping” task. Let’s say you have a table of data that looks like this:
Figure 5.35
When you have date-related columns in your source data, you really should turn those into a single date column. By doing so, you turn a wide and short table into a narrow and tall table.
Date columns like this in Power Pivot are very inconvenient. They force you to write measures for each month (for instance, Jan Sales , Feb Sales , etc.) and make it impossible to perform time-series analysis using things like DATEADD() , DATESYTD() , and the greatest formula in the world (all of which are covered in DAX Formulas for Power Pivot ; also see
http://www.powerpivotpro.com/2011/12/the-greatest-formula-in-the-world-part-one/ for more on the greatest formula in the world).
So how do you reshape this table into something that is much more Power Pivot friendly? Again, Power Query to the rescue:
Figure 5.36
This time, the source data lives in an XLSX file. No problem. Once you have browsed to the source XLSX, you are presented with a list of all the tables available in that workbook (in this case, it’s just the Sales worksheet).
Handling the Header Row
Select the Sales table in the navigator, click Edit, and you get this:
Figure 5.37
Initially, Power Query treats the first row of the table as data rather than as a header row. No problem: Just click the Use First Row as Headers button to take care of the header row issue once and for all.
Unpivoting
Next, you need to select all the columns that you want to “transform” into a single column. You do this by clicking the leftmost column header that you want to combine (Jan in this case):
Figure 5.38
Note
Note that the Power Query team is rapidly improving their product, and they release new versions all the time. By the time you read this, it is VERY likely that the Power Query UI will have changed significantly, so the images here may no longer precisely match the product. The “heart” of the techniques, however, remains the same.
Then scroll to the rightmost column that you want to combine and Shift+click it:
Figure 5.39
Now, with those columns selected, click Unpivot on the ribbon:
Figure 5.40
This gives you almost what you want:
Figure 5.41
You now have an Attribute column that is the desired month column, but you also still have some blank columns, which you need to remove. (You also could have removed them prior to unpivoting.)
After you delete those needless columns and right-click and rename the Attribute column Month, you see this:
Figure 5.42
Bingo. Next, you check Load to Data Model and click Apply & Filter to get this:
Figure 5.43
You now have the right shape, with the right column, and 576 rows. (The original wide data set had only 48 rows but far more columns.) When you click Refresh, it all just works.
Why This Is So Amazing
Not only is unpivoting good for formulas—it means you can write a single sales measure instead of 12, and time intelligence calculations are possible—but it is also good for file size and speed. One of the topics covered in the “Performance” chapter of DAX Formulas for Power Pivot is precisely this unpivot operation.
Of course, at the time of that book’s first edition, we did not have Power Query. Things are better now. Now you can set up this unpivot query one time, and every time you click Refresh (or every time a scheduled SharePoint refresh runs), the right thing just happens.
Note
It’s kinda funny, isn’t it? In a world of Pivots, one of the most useful tools is Unpivot. And you use it to feed data into Power Pivot.
Using Power Query to Create a Lookup Table from a Data Table
DAX Formulas for Power Pivot covers the many benefits of using separate data and lookup tables. Performance, file size, and the ability to analyze multiple data tables in the same Pivot (such as Budget and Actuals) all are greatly improved by separating your lookup and data tables.
If lookup tables are a good thing, what do you do when you don’t actually have a lookup table, and all you have is a data table? For example, is this Bookings table a data table, or is it a data table with a lookup table hiding inside it?
Figure 5.44
Each row in this table captures three essential pieces of information: who, when, and how much. A single column (CustomerKey, Date, and Amount, respectively) is responsible for capturing each piece of information. This is the kind of information that data tables are meant to capture. So far, so good.
But the table also contains additional information about each customer: full name, phone, and address. This is overkill; there is no need for each of the first three rows, for instance, to tell us that customer 11000 lives at 3761 N. 14th St. If we know it’s customer 11000, we could look up that customer’s address in a separate table! This sort of information is better captured in lookup tables, and ideally, you would “outsource” such columns (FullName, Phone, and Address) to a lookup table.
Ways to Create Lookup Tables
There are several ways to build a lookup table. You could use a flat Pivot to create a four-column (Pivot) table in Excel with just CustomerKey plus the three columns in question, and then you could copy/paste or link (or save as CSV) that table back into Power Pivot to create a new Customers (lookup) table. But that would require manual effort each time you wanted to refresh and import the latest Bookings table data. New customers are likely to be appearing in the data all the time, and with this method, the manually created Lookup table will constantly fall out of date.
You could instead build a lookup table in the original source database (or ask nicely for someone else to do that for you). This is a fabulous solution—when it’s actually an option. But you don’t always have access to a database (and/or the skills to operate one).
You now have a really good third solution to this problem. As you can probably guess, you can use Power Query to create a lookup table. Because the Bookings table lives in Access, that’s the Power Query button you need to click:
Figure 5.45
Then you remove the columns that you do not want in the new Customers table. Date and Amount are not relevant in a table that describes each customer, so you Ctrl+click those two columns to select them and then right-click and choose Remove Columns:
Figure 5.46
Now, to get a single row for each customer key, you just select the CustomerKey column and click Remove Duplicates:
Figure 5.47
Finally, you rename this query Customers and select Load to Data Model:
Figure 5.48
Working with the Results
Here’s the freshly minted Customers table in Power Pivot:
Figure 5.49
It’s sorted by CustomerKey. Each key appears just once, but you still have properties columns including FullName, Address, and Phone. Perfect!
Now you can import the original data table (Bookings). This time, the only change you will make to the “raw” Bookings table is to remove the excess customer property columns (FullName, Phone, and Address) by right-clicking them and selecting Delete Column:
Figure 5.50
You then load the Bookings table into Power Pivot and relate it to Customers via CustomerKey:
Figure 5.51
With just a few simple steps, you have a refreshable, self-maintaining lookup/data table pair. You could even call this an elegant solution to a very common problem.
Creating a Calendar Table: Advanced Usage of Power Query
Power Pivot needs a calendar table like I (Rob) need coffee: When deprived of this essential ingredient, both of us are still useful to an extent but fall short of our potential…and sometimes return incorrect or confusing answers.
<Pause for a sip of coffee.>
Of course, creating a calendar table involves all the same problems as creating a lookup table; after all, a calendar table is merely a special kind of lookup table! If you manually create a calendar table, you have to update it frequently to include more recent dates (and sometimes drop older dates from the past) or you have to be okay with a calendar table that extends “into the future”—which can pollute your slicers and Pivots with needless date values and make some of your calculations do funny things in the most recent (incomplete) month.
Trust me on this: It’s less hassle to have a calendar table that is in lockstep with your data table(s)—one that starts on the first date for which you have data and ends on the last date for which you have data. To create such a “lockstep” calendar, you have the same choices as above: manual (repeatedly), database, or Power Query.
“Wait, I Don’t See a ‘Make Calendar’ Button!”
Yes, as of version 2.9, Power Query does not offer a nice one-click solution to this problem. What it does have, however, is something called M . M is not a character from a James Bond story but a data transformation formula language. So it’s even more exciting than a secret agent.
The bad news: M is brand new and uses different syntax than Excel and DAX. The good news: If you’ve been following the Power Query examples above, you’ve already been writing M:
Figure 5.52
Consider the earlier unpivot example: The language you see in the formula bar is M, and it was generated by via ribbon button clicks. A query can contain multiple formula steps, and each step can reference previous steps by name.
So the ribbon buttons on the Power Query ribbon tab are similar to the Macro Recorder for VBA macros: They help you write M without knowing M, but then you can edit the M by hand if you want.
Note
We won’t lie to you. We did not know how to write the M formulas to generate a lockstep calendar table. But we did know who to ask at Microsoft, and they happily sent us the M we needed.
Keeping It Brief
We’d consume about 10 pages here if we shared every screenshot involved in using M, but that would just make this look harder than it is. And you can always grab the workbook and inspect it yourself. So instead, we’ll just show you what it looks like when it’s done:
Figure 5.53
This lockstep calendar table is generated from the Bookings data table, using a multistep query in Power Query. Here are the M formulas for each of those steps:
ChangedType = Table.TransformColumnTypes(_ Bookings ,{{" Date ", type date}}),
MaxDate = Record.Field(Table.Max(ChangedType, " Date "),"Date"),
MinDate = Record.Field(Table.Min(ChangedType, " Date "),"Date"),
DaysElapsed = Number.From(MaxDate-MinDate),
DatesList = List.Dates(MinDate, DaysElapsed+1,Duration.From(1)),
RawDatesTable = Table.FromList(DatesList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
ChangedType1 = Table.TransformColumnTypes(RawDatesTable,{{"Column1", type date}}),
RenamedColumns = Table.RenameColumns(ChangedType1,{{"Column1", "Date"}}),
DuplicatedColumn = Table.DuplicateColumn(RenamedColumns, "Date", "Copy of Date"),
DuplicatedColumn1 = Table.DuplicateColumn(DuplicatedColumn, "Date", "Copy of Date.1"),
DuplicatedColumn2 = Table.DuplicateColumn(DuplicatedColumn1, "Date", "Copy of Date.2"),
TransformedColumn = Table.TransformColumns(DuplicatedColumn2,{{"Copy of Date", Date.Day}, {"Copy of Date.1", Date.Month}, {"Copy of Date.2", Date.Year}}),
RenamedColumns1 = Table.RenameColumns(TransformedColumn,{{"Copy of Date", "DayOfMonth"}, {"Copy of Date.1", "MonthOfYear"}, {"Copy of Date.2", "Year"}})
These formulas all assume that you have an original table named Bookings, and it contains a column named Date. (The three places where these names appear are formatted in italics in the formulas above, so that you can see where to change them in order to match your own data set.)
How to Insert New Formulas
If you’re not using ribbon buttons, it’s a little difficult to discover how to manually add M formula steps. But it’s a simple matter of clicking the fx button:
Figure 5.54
There’s also another method. The intrepid among you may decide to use the Advanced Editor:
Figure 5.55
This method can be easier for bulk copying and pasting of formula steps.
The Results
The “punchline” of all this is that you get a calendar table, in Power Pivot, that starts on the oldest date from the data table and continues up to (and including) the most recent date in the data table. Each date appears only once, and no dates are skipped. These are all important qualities of a calendar table. And, of course, everything updates automatically and properly every time you click Refresh (or every time a scheduled refresh runs on SharePoint).
Here you can see that Power Query used the original Bookings table (27,617 rows) to produce a calendar table (1,127 rows) with one instance of each date and no gaps:
Figure 5.56
You also use Duplicate Column and some transforms in Power Query to create a few of the other columns:
Figure 5.57
This figure shows how you add the DayOfMonth column in Power Query:
1. Select the Date column.
2. Click Duplicate Column.
3. Select the new (duplicated) Date column and click the Day transform.
Then, although it’s not shown in the figure, you rename the duplicated-then-transformed column to DayOfMonth.
Removing Duplicates
In this trick, could you use Remove Duplicates instead? Maybe, but doing so would be risky. Remember that calendar tables need to be “unbroken”—that is, no gaps (missing dates). Even if your business is never open on weekends, you still need the weekend dates in there. If you use Remove Duplicates on the Date column in your data table, and there are some dates missing, you will get a poorly formed calendar table as a result.
Maybe you know that you never have such gaps in your Data table, and as a result, you are comfortable using Remove Duplicates. This is absolutely fine. We just wanted to make sure you understand the risks and provide a safer alternative.