Chapter 6 Using Power Query to Quickly Clean Up Data
Power Query is a new add-in for Excel 2010 (and later) that is, quite simply, going to revolutionise your data clean-up tasks. There’s not room in this book to cover Power Query in great depth, but this chapter describes how to use it for three frequent data clean-up tasks.
Note A lot of the inspiration I got for this chapter came from attending Ken Puls’s talk at the Excel MVP conference in Amsterdam in April 2015.
As you’ve heard throughout this book, if you want to do any sort of data analysis, your data needs to be clean. That is, it can have no blank rows and no blank columns, and it needs to be organised in a table. As you’ve heard so many times in this book, the data needs to be normalized, with one row per transaction and all details completed. This is the Holy Grail state for data.
However, most data downloads (particularly from accounting packages) are not organised like this. They usually have subtotals, extra columns, extra rows, and the double underlines that accountants use to say “THIS is the total.” When you get such a download, you may need to spend many hours deleting columns, removing totals, and removing rows—and then you may need to repeat that experience every month (usually at month end, when you’re sure to have loads and loads of time).
You’ve learned a bunch of techniques for getting to the Holy Grail state with your data. Alas, it usually takes a lot of work. But learning to use Power Query can help you eliminate most of that work. Power Query lets you do that work just once, save the recipe for your secret sauce, and then call up the recipe again the following month or week and throw in the appropriate data. Power Query does the work for you, getting the data at least most of the way to the Holy Grail state.
Note You do not, by default, have Power Query installed. This chapter does not cover how to install Power Query, but you can easily find Power Query and instructions for installing it at the Microsoft site—or you can ask your IT people to install it for you. Don’t forget to emphasise that it is free; the IT department generally likes free. The version you use depends on your machine and your version of Office.
Power Query allows you to pull in data from numerous data sources, including Facebook. It also allows you to easily import text/CSV data from a folder—yes, a folder!
This chapter focuses on three data clean-up tasks that people often need help with:
Cleaning Up an Accounting Data Dump
In this section you are going to learn how to clean up a typical accounting file data dump. I have taken a data dump from a QuickBooks sample company file. This dump is very typical of what an accounting package produces when you click the ever-present Send to Excel button.
If you open the file 06_Accounts_data_dump_01, you’ll notice straight away that it has a lot of extra blank rows and columns, and you’ll see that totals appear here and there. It’s definitely not pivot table ready. But it is a very typical accounting file that gives a list of sales of products. Unlike the typical accounting files you’ll see, though, this list has already been converted to a table.
Note If you plan to use the Power Query work you do here as the basis for future clean-ups, it’s important that you convert the entire list you’re cleaning up to a table to ensure that Power Query will pick up all the data, even if the new data has more rows than previous months.
Of course, one of the things you want to check is that the data is still the same when you are finished with your clean-up. So you need to take a look at the total before you begin and again at the end. With this example, before you begin the clean-up process, the total amount is 65,010.65 (cell U130) and the total quantity is 128 (cell Q130).
Survival Tip You can quickly navigate to a cell by pressing F5 and typing the cell number in the Reference box. For example, you can press F5 and type U130 in the Reference box to jump to cell U130. |
Okay, you have seen what you are facing. You can put it down slowly, ma’am, and step away from the spreadsheet. Close it. (And if you are prompted about saving the changes, close it without saving.)
Creating a Query in Power Query to Clean Up Accounting Data
In Power Query terms, a query is a series of steps you create to pull data into Power Query and perform a number of actions on the data. In the background, Power Query records those steps (using a language called M, which has nothing to do with the M of James Bond, although it is pretty powerful). A query is somewhat similar to a macro in that you record steps and can later play them back. Once you’ve recorded a query in Power Query, all you need to do to perform the steps again is change the data source you apply the query to.
Instead of doing all the clean-up the old-fashioned way, you can use Power Query to clean it up and make it pivot table ready (with headings, no blank rows, no blank columns, no extra total rows,…).
1. Open a new file in Excel.
2. Select Power Query | From File | From Excel (see Figure 6-1).
Figure 6-1
3. Navigate to where you have stored 06_Accounts_data_dump_01 and click it. Click OK. Note that this file has already been converted to a table.
4. In the Navigator window that appears, click the Accounts sheet on the left and then click the Load button (see Figure 6-2). You are nearly there.
Figure 6-2
5. Click Query | Edit Query (see Figure 6-3). In the Power Query window that appears, on the right is the Applied Steps window, which lists the steps you take to clean up the data (see Figure 6-4).
Figure 6-3
Figure 6-4
The Applied Steps window essentially provides a record of what you do with the data. Although you can’t “undo,” you can remove a step by clicking the x to the left of the step (see Figure 6-5). However, if you click on the x next to steps that are taken toward the beginning of a query, you may sometimes receive a warning.
Figure 6-5
Note Once you have your data in Power Query, it’s important to remember that most of the operations you need are on the Home tab.
In Figure 6-5, you can see that some of the columns already have names because this data has been converted to a table. You can also see that some of the cells contain the word null. Power Query inserts null wherever there was a blank cell in the original data.
You need to remove all the columns that contain nothing but null. You can click the filter at the top of a column to see what is in the column. When you do this, you find that the following columns have no data (i.e., contain nulls only): Column 1, Column 5, Column 6, Column 7, Column 8, Column 9, Column 10, Column 11, Column 12, Column 13, and Column 14. The following steps walk you through what you need to do to clean up your data:
1. Remove each column that contains nothing but nulls—Column 1, Column 5, Column 6, Column 7, Column 8, Column 9, Column 10, Column 11, Column 12, Column 13, and Column 14—by highlighting it, right–clicking, and selecting Remove (see Figure 6-6). Or if you hold down the Ctrl key, you can highlight more than one column at a time and then right-click and select Remove. Keep in mind that you are removing these columns because you do not want any blank columns in the final data set.
Figure 6-6
Survival Tip Make sure you keep your mouse pointer near the column heading when highlighting the columns that contain only nulls. If you don’t, many of your right-click options (e.g., Rename) will not appear. |
2. Remove the last column, Balance, the same way you’ve removed the other columns.
Note Remember that if you accidentally delete something you didn’t want to delete, you can remove the step by clicking on the x next to it in the Applied Steps window (refer to Figure 6-5).
3. Right-click column 2, select Rename (see Figure 6-7), and type the name Category. (Figure 6-7 shows the column already renamed Category.)
Figure 6-7
4. To fill down the Category column, select Transform | Fill | Down (see Figure 6-8).
Figure 6-8
5. Right-click column 3, select Rename, and change the name to Subcategory.
6. To fill down the third column, click Subcategory and select Transform | Fill | Down.
7. To fill down the Type column, click Type and select Transform | Fill | Down.
8. To filter the Type column to exclude nulls, click the Type heading. Click the filter triangle and make sure Null is not ticked (see Figure 6-9).
Figure 6-9
9. Filter the Date column to exclude nulls. Again, do this by clicking the Date heading, clicking the filter triangle, and removing the tick beside Null.
10. Because column 4 now contains just null values, remove it.
11. The data is looking good now, so type the name 06_Accounts_clean_up in the Name row of the Query Settings pane (just above the Applied Steps window).
12. Select Home | Close & Load | Close & Load (see Figure 6-10).
Figure 6-10
Excel brings the data back in, and it now looks as shown in Figure 6-11.
Figure 6-11
Checking Your Results in Excel
At this point, you need to check your numbers. Remember from earlier in the chapter that the total amount was 65,010.65, and the total quantity was 128. Now that you have the cleaned up data in Excel, you need to check that the data is accurate.
1. Highlight the Amount column by clicking in the top cell and pressing Ctrl+Shift+Down Arrow.
2. Check the sum calculation on the bottom-right side of the status bar. Yep, there it is: 65010.65 for the amount.
3. Highlight the Quantity column by clicking in the top cell and pressing Ctrl+Shift+Down Arrow.
4. Check the sum calculation (not the count) on the bottom-right side of the status bar. It’s 128—just the number you were hoping for.
5. Now you have completely tidied up the data, and it’s ready to be pivoted, so save the file as My_06_query_accounts_clean_up.
Okay, this particular data set looks great. But what about next month, when you have to clean up a new data set?
When my son was young and I’d toss him up in the air (just a little bit…), he’d beg me “Mom, do it again,” and that’s what we are going to do here (the repetition, not the tossing up in the air). Now you need to clean up a different file, but you need to clean up all the same stuff in it that you just cleaned up in 06_Accounts_cata_dump_01. Pretend that the file you’re cleaning up here is the next month’s file.
1. Open the file 06_Accounts_data_dump_02. (Note that the data here has already been converted to a table.)
2. Check the quantity in cell Q109: It’s 146. Also check the amount in cell U109: It’s 57,751.82. Make a note of these figures because you’ll be using them as a cross-check later on.
3. Close the file 06_Accounts_data_dump_02.
4. Open the file My_06_query_accounts_clean_up. (You may have to click a button that says Enable Content
Note This file contains the Power Query code (called M) that was created in the background when you cleaned up the file 06_Accounts_data_dump_01. You are now going to use this code to clean up the 06_Accounts_data_dump_02 file.
5. Select Power Query | Launch Editor (see Figure 6-12).
Figure 6-12
6. Find Source in the Applied Steps window and click the cog to the right of it (see Figure 6-13).
Figure 6-13
7. In the dialog box that appears, click the Browse button (see Figure 6-14), navigate to 06_Accounts_data_dump_02, and click OK.
Figure 6-14
Note The path that appears for your file will not be what you see here because you will have the files stored in a different location.
8. Select Home | Close & Load | Close & Load (refer to Figure 6-10). You are now back in Excel again.
9. To check your sum, highlight the Amount column by clicking in the top cell and pressing Ctrl+Shift+Down Arrow. You see in the status bar that the sum is 57751.82.
10. To check your quantity, highlight the Quantity column by clicking in the top cell and pressing Ctrl+Shift+Down Arrow. You see in the status bar that the sum is 146.
Holy moly! That was FAST! That’s right: You’ve cleaned up the next month’s file, and it took less than a minute! All you have to do now is save this file under an appropriate name, and you are ready to pivot the data. (Honestly, every time I do this, I’m impressed myself.)
Converting Unpivoted Data to a Pivoted Format
Excel works best with data that is presented vertically, as in Table 6-1. This is also called normalized data, as you’ve heard a time or two in this book.
Table 6-1: Unpivoted Data in Correct Normalized Format
Date |
Invoice Number |
Product |
Amount |
1/11/2015 |
00123 |
Apples |
100 |
1/11/2015 |
00123 |
Oranges |
500 |
1/11/2015 |
00123 |
Pineapples |
150 |
But Table 6-2 shows how most Excel users assemble their data.
Table 6-2: Already Pivoted Data
1/11/2015 |
1/11/2015 |
1/11/2015 |
|
Apples |
100 |
||
Oranges |
500 |
||
Pineapples |
150 |
The problem is that as more invoices and products are added, Table 6-2 becomes more and more unwieldy, and extracting data out of it becomes harder and harder. On the other hand, even though it seems counterintuitive, Table 6-1 is tailor made for use with a pivot table. And it also allows you to more easily add new invoices and products. The idea is for data to be vertical rather than horizontal.
There are quite a few ways in Excel to “unpivot” data, but usually they involve a few formulaic gymnastics. You’ll be glad to know that the steps for unpivoting data in Power Query are relatively straightforward.
If you open the file 06_Data_for_unpivot, you see that it is classic pivoted data (but has already been converted into a table). However, if you want to do a pivot table on it in order to do more analysis, you couldn’t do it as is. Here you are going to use Power Query to unpivot this data. You are also going to pull this data into Power Query in a slightly different way than you did earlier in the chapter.
1. With the file 06_Data_for_unpivot open, select Power Query | From Table and note that Power Query automatically selects and pulls in all the data in it (see Figure 6-15).
Figure 6-15
2. Highlight column 1 (Row Labels) and select Transform | Unpivot Columns | Unpivot Other Columns (see Figure 6-16).
Figure 6-16
Note In step 2 you are unpivoting the columns, but you could also do it by highlighting all columns except column 1 and then selecting Transform | Unpivot Columns | Unpivot Columns.
3. Right-click column 1, select Rename, and change the name to Account Name.
4. Right-click the Attribute column, select Rename, and change the name to Month.
5. To convert the month entries to date format, highlight the column Month. Then check the Home tab, and you can see that the dropdown says Text. Now click on the triangle beside this and choose Date. Note that the months now become dates. You need the data in this format if you want to use months as a way to summarize the data.
6. For Value, to make sure the numbers are numbers, highlight the column and check the dropdown on the Home tab to ensure that it says Decimal Number, as shown in Figure 6-17. You don’t need to rename it.
Figure 6-17
7. Type the name Accounts in the Name row of the Query Settings pane (just above the Applied Steps window), as shown in Figure 6-18.
Figure 6-18
8. Select Home | Close & Load | Close & Load.
9. Excel brings back the data, and it’s ready to be converted to a pivot table.
Note In this example, you have pulled the data straight into Power Query from a table. Because you have done it this way, there is no “cog” available (like the one shown in Figure 6-13). This shows that you can just do a Power Query “on the fly.” If you did want to re-use the query, you would have to ensure that the data is in a separate file and then pull it in as discussed earlier in this chapter. You can practise creating an on-the-fly query by using the file 06_Data_unpivot_02 (which has a similar format to 06_Data_for_unpivot but contains just 6 months of data.)
Creating a Query in Power Query to Merge Data Sets
You’ve already learned how to use Power Query to clean up an accounting data dump and also how to use Power Query to convert unpivoted data to a pivoted format. Now you’re going to see how to use Power Query to merge data sets. For this scenario, you will be working with three separate files for three separate months.
In the folder 06_Merge_folder there are three files that you need to combine:
Note that these files are all in CSV format. If they were Excel files, you would need to take a different approach than is described here.
Note These files use the European date structure dd/mm/yyyy.
You need to combine these files into a single file, but before you do, you need to do a quick check of the starting totals:
So for all three files, you get the following: total count = 1917, summed total = 11996732.
1. Close all three of the files you want to combine and select Power Query | From File | From Folder (see Figure 6-19).
Figure 6-19
2. In the Folder dialog box that appears, click Browse (see Figure 6-20) and navigate to the folder where you have stored the file and click OK. You are now in Power Query, and one of the columns has the title Extension.
Figure 6-20
3. If you have any other type of files in the folder, filter in the Extension column so you see just the CSV files (see Figure 6-21).
Figure 6-21
4. Click the double arrow beside the Content column (see Figure 6-22).This automatically combines the files together and you will see the content of the three of them listed.
Figure 6-22
Note If any of your chosen files are not in proper CSV format, the query will not work at all. The file should have been saved as CSV (comma delimited), with the extension.csv. If this is not the case, you need to save it in this format.
5. Because in each file the Line Item entry has no data attached to it, remove the entry from each file (by highlighting the Line Item column and filtering to remove it, making sure everything except Line Item is ticked in the Line Item column).
6. For the Amount column, highlight the Amount column and on the Home tab, ensure that the dropdown is set to Decimal Number (to ensure that any numbers with decimal places are picked up).
7. Select Home | Close & Load | Close & Load. The file comes back into Excel, where you can check your numbers.
8. To check your sum, highlight the entire Amount column by clicking in the top cell and pressing Ctrl+Shift+Down Arrow. You see in the status bar that the count is 1917. To check your quantity, highlight the Quantity column by clicking in the top cell and pressing Ctrl+Shift+Down Arrow. You see in the status bar that the sum is 11996732. These numbers match the ones you got when you added together the numbers in the original three files.
9. Save this file as Quarter_1_power_query.
Running Your Query to Merge Data Sets
Your data-merging query is great so far, but what happens next month, when you get another month’s worth of data to add? You should be able to guess by now that Power Query makes it a piece of cake.
1. Open the folder 06_Merge_folder. Add to it the file called 06_Merge_files_April from the folder 06_Power_Query. This folder should now contain four files:
Note that the total for the Amount column in the April file is 4185595. When that is added to the total for the preceding three months, you get a total of 16182327.
2. Open the file Quarter_1_power_query. With your mouse in the data, select Data | Refresh All (see Figure 6-23), and Excel automatically brings in the April numbers.
Figure 6-23
3. Highlight the totals in the refreshed file, and you get 16182327—just what you calculated it should be.
Survival Tip When I was coming to grips with Power Query, I found two courses absolutely invaluable. The first one was the course offered by Ken Puls and Miguel Escobar at www.powerquery.training. The second was Mynda Treacy’s www.myonlinetraininghub.com/excel-power-query-course. I highly recommend both of them. |
I’ve covered data clean-up in two chapters of this book because in your Excel journey, you will be spending a lot of time doing clean-up. But Power Query is a game changer. Learning how to use it can be one of the most valuable time investments you can make. You can’t really control how people enter data. Dirty data is a fact of Excel life. But using Power Query is like having a dishwasher instead of doing dishes manually.