MDataMonkeyeBookNoIconsForEPub-6

Chapter 4 Importing All Files in a Folder

The classic way that Excel pros deal with importing all files in a folder is incredibly tedious and error prone. Each file needs to be imported, transformed, copied, and pasted into the master table. Depending on how big and complex the transformations are, how many files there are, and how long the solutions have been running, this process can be terrifying.

You’ve already seen that Power Query can eliminate the copying/pasting dangers involved with importing and appending files on a one-by-one basis, but what about these pretty serious issues:

Importing multiple files is tedious.
Repeating complex transformation steps is error prone.

The good news is that Power Query has a way to deal with both of these issues as well.

Note: This chapter focuses on importing binary files, such as TXT and CSV files. Chapter 5 covers importing data from multiple workbooks.

Sourcing All Files in a Folder

The example files for this chapter are broken down into four subfolders:

Begin
2008 – More
2009
2010

Each folder contains a variety of text files that you will combine.

Extracting the Files Listing

To get started, open a blank workbook:

Start a new query → From File → From Folder
Click Browse

You are presented with the dialog that Power Query uses to select folders.

Locate the Ch04 Examples\Begin folder → select it → OK (to close the folder picker dialog)
Click OK (to load the folder to Power Query)

You now see a new Power Query window. This time, things look quite different than what you’ve seen so far. Instead of transactions, you see a list of files and their properties:

Figure 46 The list of files in the Begin folder.

Future-Proofing File Selection

In this chapter, you will be combining all the CSV files from the four folders. Before you go any further, you should filter this list to ensure that only CSV files are present. Even though you see only CSV files in the screen above, you just never know when someone else will decide to store an Excel workbook in this folder. If that happens, it’s sure to break your solution right when you’re at your busiest. For the amount of effort it takes, a little future-proofing is well worth the time.

To filter to just CSV files, you’d normally click the filter arrow on the Extension column, uncheck the (Select All) option, and then check CSV. Unfortunately, because you only have CSV files in the list right now, doing that rechecks the (Select All) option, which is not going to help. You’re going to need to go a bit deeper to protect yourself here and force your filter to CSV.

Before you do that, however, you also need to guard against accidentally filtering out “CSV” files when you filter for “csv” (as text filters are case sensitive). To deal with this, follow these steps:

Right-click the Extension column → Transform → Lowercase
Select the Extension column’s filter arrow → Text Filters → Equals
Set the filter to Equals .csv

Warning: Don’t forget the period in front of the csv! If you forget it, you won’t get any files, as csv is not the same as .csv.

Click OK

Even though things don’t look any different after you take these steps, you’ve now forced the system to only accept CSV files for your operations, reducing the chance of contaminated data blowing apart your solution.

Combining Files

Now the time has come to combine files. Provided that you are working with binary files such as TXT or CSV files, this is incredibly easy once you know the trick.

Combining the Initial File Set

Take a careful look at the icons in the first three columns of the current query:

Figure 47 The first three columns of the current query.

The Name column shows an unfiltered filter icon, and the Extension column shows a filter that is currently being used. But what about the Content column? What is that icon?

It’s incredible, that’s what it is. That tiny little button appears on columns containing binary files, and when you click it, magic happens:

Figure 48 Magic in action (combining binaries).

A lot of stuff happens when you clicked that magic button, and looking at the Applied Steps box lets you retrace all of it.

In this example, we were on the Filtered Rows step when we clicked. From there Power Query combined all the individual files into one massive CSV, imported it, scanned it, promoted the first row to headers, and made a guess about the data types. All this happened with the click of one tiny button!

That seems too good to be true, and naturally the preview window is saying that you’ve got 999+ rows, so you can’t even verify that it worked. You need to finalize this query and pivot it so that you can reassure yourself that this is working. Follow these steps:

Remove the Changed Type step
Right-click TranDate → Change Type → Using Locale → Date → English (United States) → OK
Right-click Account → Change Type → Whole Number
Right-click Dept → Change Type → Whole Number
Right-click Amount → Change Type → Decimal Number
Rename the query from Query1 (or Begin in 2013) to Transactions
Select TranDate → Home → Reduce Rows → Remove Errors
Right-click TranDate → Rename → Date
Right-click Sum of Amount → Rename → Amount
Home → Close & Load

The data lands 6,084 rows in an Excel worksheet, which looks promising. Now you can pivot it:

Select a cell in the table → Insert → PivotTable

Create the PivotTable on the same worksheet in cell F2

Drag Amount to Values

Drag Date to Rows

Right-click F3 → Group → Select Months & Years → OK

You can see below that you are indeed looking at the data from all three files—and you didn’t have to pre-import each file and then append them manually!

Figure 49 Data representing all files imported from a single folder.

Adding More Files

One of the things that makes Power Query so great is the ability to refresh a solution at will and see it work. So far in this book the examples have been targeted at specific files and, without changing the details of the files, it’s hard to see that a refresh has any effect. That changes now.

Go back to the Ch04 Example Files folder. Remember that there were four folders in there, and you targeted a solution against the Begin folder. Now you’re going to simulate the process of adding new files to the solution:

Drag the 2008 – More folder into the Begin folder
Return to Excel → Data → Refresh All

Power Query kicks off the refresh process and updates the Transactions table, showing that it has loaded 25,700 rows. However, once again, the PivotTable doesn’t update:

Figure 50 Plainly you’ve got more files, but why aren’t they in the PivotTable?

It’s an easy fix. Either right-click the PivotTable → Refresh, or just use Data → Refresh All again. Here’s what you end up with:

Figure 51 The PivotTable properly updates, proving that you pulled in the records.

How about you add even more files?

Drag the 2009 folder into the Begin folder
Drag the 2010 folder into the Begin folder
Return to Excel → Data → Refresh All → Refresh All

Warning: Using a double Refresh All approach works fine for small data sets. When they get large, however, you might want to use VBA to automate the process of refreshing the queries first and the PivotTables second, as that avoids duplicate calls to the data source. This topic is covered in Chapter 16.

Absolute magic!

Figure 52 60,136 records at the click of a button (or two).

With a solution built in this manner, all you have to do each month is add your data files to the subfolder as you receive them and then refresh the solution.

Recursion

One of the really interesting factors in this solution is that you didn’t need to drop the files directly into the root of the Begin folder in order to have them pulled into the Power Query solution. Power Query practices recursion by default, meaning that it examines the folder and all its subfolders for files. This allows you to keep your data files subclassified by year, if desired, and the solution will still work.

Note: If you go back and edit the query, refresh the preview and then returning to the Source step, you see that the folder path for each file is also listed in the Folder Path column. Using this, you could filter to include only the root folder or exclude certain subfolders should that need arise.

Preserving File Properties on Import

Being able to import all files in one shot is a huge time saver, but consider the following real-world scenario.

A system was set up to export a list of transactions to a file and name it based on the last month and the year (for example, Feb 2008.csv). The issue in this particular case is that the system was not set up to include a transaction date inside the file, as all transactions were assumed to be the month-end date. The user who imported these files was expected to import all transactions and place the appropriate month-end date on each row, and he had two years’ worth of files to process. Unfortunately, when combining files using the method outlined earlier in this chapter for combining binaries, you lose access to the filenames, which were a key component in solving this solution.

Does this mean you can’t use Power Query to do the job? Not at all. You just need to use some different tricks, described next.

Setting Up the Import

The initial steps in setting up the import are the same as outlined earlier in the chapter, with the only difference being that you’ll now get a large list of files (since you moved the extra folders into the Begin folder). Start by following these steps, which are the same ones you took earlier:

New Query → From File → From Folder
Select the Begin Folder
Right-click the Extension column → Transform → Lowercase
Select the Extension column’s filter arrow → Text Filters → Equals → Equals .csv

From here, things change from what you’ve done before

Removing Extraneous Columns

There are a lot of columns in the Power Query editor right now, and you really don’t need them for the next steps. You should get rid of the irrelevant ones so you can focus on the ones that truly matter:

Select the Name column → hold down Ctrl → select the Content column
Right-click one of the columns → Remove Other Columns

Note: Power Query works as a sequential macro recorder, processing each step before moving on. This means that unlike Excel, it doesn’t require precedent columns to stick around. Given that, and because you’ve already filtered CSV files, you can remove that column as well.

These steps had two effects. The first is that you removed all the extraneous columns in the easiest fashion possible (which is more future-proofing in case the Power Query team decides to add another column to this view in a future update). Second, because of the order in which you selected the columns, you flipped the order in which they are presented in the Power Query editor, as shown below:

Figure 53 Columns thinned down and reordered in the Power Query editor.

Converting Text to Dates

The next task is to convert the filenames into valid dates for the month end. Power Query actually has some very good facilities for this, but in order for them to work, the data needs to look like a date, which takes a couple of tricks:

Right-click the Name column → Replace Values
Set Value to Find to .csv → leave Replace With blank → OK
Right-click the Name column → Replace Values
Set Value to Find to a single space → set Replace With to “ 1, ” (space number one comma space—but without the quotation marks) → OK

The data should now look as follows:

Figure 54 Data that looks like a date but is still text.

At this point you can force the data to be a valid date and increment it to the last day of the month:

Right-click the Name column → Change Type → Date

Go to the Transform tab → Date → Month → End of Month

Right-click the Name column → Rename → Date

You now have the month-end dates prepared to merge with your transactional records:

Figure 55 Proper dates in the Date column.

Note: There is no need to force this set of dates to import using the locale settings, as Power Query will correctly recognize a full text date.

Merging File Properties with Binary Content

At this point you are ready to merge the month-end dates with all the rows inside the binary files. The big challenge is that clicking the Combine Binaries button just throws away the Date column you worked so hard on. So you need to extract the contents of each binary file. Here’s what you do:

Go to the Add Column tab → Add Custom Column

Enter the following formula:
=Csv.Document([Content])

Click OK

Warning: Power Query formulas are case sensitive, so make sure that everything is cased correctly, or it won’t work.

Note: If you don’t want to type a long field name, you can just double-click it in the field list when you are building your custom column formula. This will place the field list in your formula, surrounded by square braces.

The result is a new column called Custom, which is filled with a bunch of tables:

Figure 56 A column of tables.

Stop and think about that formula for a second. The Csv.Document() function is the function you can use to turn the contents of a binary file into a table. [Content] simply refers to the column name.

Note: Although you might expect one, there is no Txt.Document() function. If you want to convert the contents of a text file into a table, you use the Csv.Document() function.

To see what is inside one of those files, click in the whitespace beside the word Table. (If you actually click the word Table, Power Query will drill into the table, but if you click to the right side, you see a preview at the bottom of the screen, as shown below.)

Figure 57 A preview of the Feb 2009 binary files contents in table format.

With that content now available to you, how do you get to it? First, clear out the noise by right-clicking the Content column → Remove.

It’s finally time. There is a new little double-headed arrow symbol at the top right of the Custom column. Click it, and you get this:

Figure 58 Expanding tables into columns.

This dialog allows you to choose which columns to expand. If there were columns you didn’t want, you would simply uncheck them. You’ve already seen via the preview that each of these four columns contains useful information, so in this case, you can keep them all.

The only change that you’ll want to make here is to uncheck the Use Original Column Name as Prefix checkbox. If you leave this set, Power Query will prefix each column with Custom, so you get Custom.Column1, Custom.Column2, and so on.

When you click OK, your columns all expand, but this time Power Query preserves the Date column (unlike the Combine Binaries feature, which re moved it):

Figure 59 The dates stick around when the column is expanded.

It’s unfortunate that Power Query doesn’t try to take any liberties and help you out with data types and such here. You have to do that manual cleanup yourself, but at least this is a one-time thing, as it will just refresh in future. Here’s what you need to do:

Go to the Transform tab → Use First Row as Headers
Right-click Column 1 → Rename → Month End
Right-click TranDate → Change Type → Using Locale → Date → English (US)
Select TranDate → Home → Reduce Rows → Remove Errors
Right-click TranDate → Rename → Date
Right-click Sum of Amount → Rename → Amount
Right-click Account → Change Type → Whole Number
Right-click Dept → Change Type → Whole Number
Right-click Amount → Change Type → Decimal Number

The query is now shaped up to be consistent with the previous output:

Figure 60 All files imported from a folder, preserving data based on the file properties.

Finally, you can complete the query:

Change the query name to ManualCombine
Home → Close & Load
And at this point you can build a PivotTable and prove that it matches the results you got in the previous section, with the only difference being that you’ve included a new field that was derived from the properties—something that is impossible using the easy Combine Binaries method.

Implications of the From Folder Feature

Importing files on an individual basis as described in Chapter 2 is generally the way people start building Power Query solutions. That approach can work well in any of the following scenarios:

Where the data set will grow slowly over time

Where the transformation requirements are different for each file in the folder

Where y ou didn’t realize the solution would expand over multiple files

If, however, you suspect that a solution may grow to be bigger than one file, and the data setup will be consistent, there is nothing to stop you from using the approach in this chapter as follows:

Set up a subfolder
Move your single file into that folder
Leverage the Import From Folder functionality against the subfolder

Yes, there is only one file in there today, but so what? As you get new files that pertain to the solution, you just throw them in the subfolder, and they are instantly part of your solution!

Sometimes a little foresight can go a long way in building solutions that stand the test of time.