Chapter 1 Importing Basic Data
If you’re an Excel pro, it’s highly likely that your life is all about importing, manipulating, and transforming data so that you can use it. Sadly, many of us don’t have access to big databases with curated data. Instead, we are fed a steady diet of TXT or CSV files and have to go through the process of importing them into our Excel solutions before we can start our analysis. For us, critical business information is stored in the following formats:
Fortunately, Power Query was built for us, and it allows us to import our data from any of these sources.
Importing Delimited Files
The process of importing a delimited file such as a .CSV or tab-delimited .TXT file is fairly straightforward, and follows the basic ETL (extract, transform, and load) process, as described in the following sections.
Extract (from the File)
The download package for this chapter contains two delimited files, either of which will work for this example. Both are named Ch01-Delimited, though one is a comma-delimited CSV file and the other is a tab-delimited TXT file. To import either delimited file using Power Query you can:
Note: In different versions of Excel, you use slightly different methods for creating a new query. To learn how to create a new query in your version of Excel, see the section “Creating New Queries” in the Introduction.
Excel launches a new window, the Power Query editor, which looks like this:
Figure 11 The Power Query editor with an imported delimited file.
Transform (into the Desired Output)
This Power Query editor view above shows some important pieces of information that are worth taking notice of:
It’s the last point that is the most salient here. If you try clicking the Source step, you see a slightly different view of your data:
Figure 12 The data as Power Query originally imported it, shown by clicking the Source step.
The impact of this is fairly important. Power Query imported your data, analyzed it, and noticed some trends. What you don’t see is that Power Query determined that your data is columnar (based on the fact that CSV files have their columns separated by commas), so it (correctly) split it into columns. That data landed in the Source step.
Power Query then analyzed your data further and identified that the first row had text headers that were inconsistent with the values in the columns. It therefore added the step Promoted Headers to promote the first row to be the column headers for your table.
Next, Power Query attempted to identify and set the data types in the columns. It made the assumption that the TranDate column is dates, the Account and Dept columns are numbers, and the Sum of Amount column is values. It then applied those data types in the Changed Type step you see in the Applied Steps box.
The great thing here is that you can step backward and forward through these steps to see what Power Query did to the data.
You can make some modifications to clean up the data . . . like fixing the terms in the column headers to be more readable. To do this, follow these steps:
The Power Query editor now looks like this:
Figure 13 The data set with renamed columns.
Take another look at the Applied Steps box right now. If you were watching closely as you did this, you’d have noticed that a new step was created when you renamed the Date column. But another step was not created when you renamed the Amount column. Instead, the two steps were merged together. Power Query tries to keep things efficient by merging like operations as you are building a query.
The query is looking pretty nice, but what if you don’t agree with Power Query’s data type choices? For example, the Account and Dept column values should be formatted as text, not numbers. Fixing this is no problem:
The Account and Dept fields are now aligned to the left of the column, which indicates that they are formatted as text, not values, consistent with Excel. You can also see a new step in the Applied Steps box, called Changed Type1:
Figure 14 The data set, looking all pretty and ready for loading.
So why didn’t the type changes merge back into the original Changed Type step? The answer is that Power Query doesn’t know if you did something important in the preceding step, so it reacts the safe way and creates a new step.
Power Query essentially works as a sequential macro recorder: It will execute exactly these steps, in exactly this order, each time you refresh the data.
Load
It’s now time to finalize the query, with the load step. Before you commit your query, you should give it a more descriptive name than the default. Excel will use the name you provide here as the name of your output table or query. Follow these steps:
By default, the data is loaded into an Excel table bearing the name of the query. You can actually see the table get created in a gray color scheme and then turn green. Each table query goes through this process whenever it is refreshed—first turning gray and then turning green when it is good to go.
You’ll also notice a new task pane, the Workbook Queries pane, pop up on the right side of your Excel window. This useful interface provides you with key information about your queries (such as number of rows loaded and errors encountered), and it also allows you to locate and manage your queries in the future.
Figure 15 The Workbook Queries task pane, showing the results of the query.
Note: If you’re working along with the book and your query shows a significant number of errors, don’t panic. This simply means that your regional settings are not set to a U.S. format. Right now we are focused on how to use the Power Query interface to import data, but rest assured that we will show you how to address this specific issue (among others) in Chapter 2.
With the data loaded in the table, you now have the ability to refresh it any time via a few different methods:
Each time one of these commands is issued, Excel triggers Power Query to open the file, process each of the steps you recorded, and place the data in the table. As long as you save the new transactions file on top of the one you used to build your solution, and as long as the data structure is the same, you’ve just automated your import process down to a couple of clicks!
Note: If you dismiss the Workbook Queries pane, you can show it again via Data tab → Show Queries in Excel 2016 or Power Query tab → Show Pane in Excel 2010/2013.
Warning: In order to preserve the table and other features of Power Query, this file must be saved in a valid Excel file format, such as XLSX, XLSB, or XLSM.
You can import three general formats of Excel data from a workbook. You can import data held in:
Note: Chapter 5 covers importing data from worksheets without tables and data contained in other workbooks.
Connecting to Excel Tables
Rather than connect to an official Excel table, in this case you will connect to data that is in a tabular format but with no table style yet applied. The data to use for this is located in the Ch01 Examples\Excel Data.xlsx file on the Table worksheet, and it looks like this:
Figure 16 Raw data in an Excel worksheet.
To pull this data into Power Query, follow these steps:
At this point, Excel kicks off the process of creating an official Excel table for you, prompting you to confirm the table boundaries and whether the data set includes headers. Once you confirm the details, you are launched into the Power Query interface.
Note: If you started with an official Excel table, you would just be launched directly into the Power Query editor, without being prompted to confirm the range.
As shown below, this interface has some differences from when you pulled in data from a delimited file.
Figure 17 Data loaded from an Excel table.
Note the differences:
Note: When you click Close & Load, Excel attempts to create a new table, using the name of the query shown in the Power Query editor. If the name conflicts with the name of an existing table, Excel appends an underscore and a number to the newly created table name to ensure that there are no duplicated names.
Finalize this query with these steps:
A new worksheet is created, and it contains a duplicate of the original table. The only differences at this point are the table name and that Power Query now has a connection to the data.
Note: There is very little reason to create a duplicate of your table without performing any transformations in the process. We show this process merely to illustrate how to connect and load from an Excel table.
Pulling data from Excel tables is by far the easiest way to pull Excel data into Power Query, but it isn’t the only method.
The challenge with applying a table style in Excel is that it locks column headers in place (breaking dynamic table headers driven by formulas), applies color banding, and makes other stylistic changes to your worksheet that you may not want. This might be a problem if you’ve spent a large amount of time building an analysis, and you don’t want a table style applied to the data range.
The good news is that you can also connect to Excel ranges, not just to tables. To do this, you can use the NamedRange worksheet in the Ch01-Excel Data.xlsx sample file. The data in it is identical to the data in the previous example, but it’s still in raw form, with no table style applied.
Pulling data into Power Query from a named range involves three distinct steps:
Follow these steps with the NamedRange worksheet:
Figure 18 Creating a named range.
Note: After you’ve created this name, you can select it by using the drop-down arrow on the left. No matter where you are in your workbook, you will then jump to this worksheet, and the data in the named range will be selected.
Note: If the named range is selected when you use the From Table command, Power Query does not force a table style on your data and instead refers directly to the data in the named range.
As you can see below, now the Power Query interface looks the way it looks when you import delimited files rather than how it looks when you connect to an Excel table:
Figure 19 Data imported via a named range.
One of the features of Excel tables is that they have a predefined header row. Since that doesn’t exist with a named range, Power Query has to connect to the raw data source and run its analysis to figure out how to treat the data. Much like with delimited text files, it identifies a row that appears to be headers, promotes the headers, and attempts to apply data types to the columns.
Notice that the default name of the query is the name of the named range. Again, it’s a good idea to change this, as Excel will append a number to the table’s name when it is created because a table’s name cannot be identical to any other table or named range. Follow these steps:
Excel tables automatically expand both vertically and horizontally as new data is added. This is one of the great features of Excel tables. But again, the challenge is that they carry a bunch of formatting with them. Named ranges don’t carry all that formatting, but they lack the automatic expansion ability that is so fantastic with Excel tables. As a workaround, you can create a dynamic named range that automatically expands as the data grows.
On the DynamicRange worksheet of the Ch01-Excel Data.xlsx file you’ll find another copy of the original data. Follow these steps to set up a dynamic named range that expands as new records are added:
=Dynamic!$A$5:INDEX(Dynamic!$F:$F,MATCH(99^99,Dynamic!$A:$A))
The challenge now is that you can refer to this named range in Power Query, but because it is a dynamic name, you cannot select it from the Name box in Excel. So if you can’t select it, how can you attach to it with Power Query in the first place? The secret is to create a blank query and then tell Power Query which range you want to connect to. Here’s how:
Note: If you don’t see the formula bar in the Query Editor between the ribbon and the data area, go to the View tab and click the Formula Bar checkbox.
=Excel.CurrentWorkbook()
As shown below, you see a table that lists all the Excel objects in this workbook that you can connect to.
Figure 20 A list of all the objects Power Query sees in the current Excel workbook.
There at the bottom is the DynamicRange object you just created. Click the green word Table in the Content column to the left of DynamicRange. Power Query drills into the range, as shown below.
Figure 21 The contents of the DynamicRange named range.
By looking at the Applied Steps box, you can see that you took these steps:
In addition, Power Query made some assumptions about the data for you, and it took the liberty of applying a few more steps, such as promoting column headers and setting the data types. All you have to do now is follow these steps:
Considerations
Where possible, it is preferable to build your solutions against Excel tables. They are easier to set up than the alternatives, easier to maintain, and quite transparent about where the data is stored.
In specific situations, you might have reasons to avoid tables, however, and you can do that where needed. We recommend that you should generally reach to these techniques only when you have good reasons to do so.