Chapter 7 Importing Nondelimited Text Files
One of the biggest challenges for many Excel pros is importing and cleaning nondelimited text files. If you’ve ever dealt with one of these, you know how painful it can be. They typically arrive with some default name like ASCII.TXT and are essentially character-by-character representations of what the output should look like when printed. This means that they’re subject to all kinds of crazy issues, including the following:
A major part of the job for many Excel pros is importing this type of information into Excel and cleaning it. And all this has to happen before they can get to the analysis that actually adds business value.
If you’ve been there, you know the process follows this general flow:
1. Import the file into Excel via Data → From Text.
2. Work in a postage stamp–sized Import Wizard window, trying to determine how the columns are delimited and which to skip.
3. Dump the results of the Import Wizard into a worksheet.
4. Turn the data into a proper Excel table.
5. Sort and filter the table to remove garbage rows.
6. Clean and trim the text in the columns.
And the best part is that next month, when you get the updated data file, you get to relive this exciting process all over again. Wouldn’t it be nice if there were a better way? Great news! There is, and you’ve found it.
Connecting to the File
You connect to a nondelimited text file the same way you connect to any other text file:
Power Query puts the data in a single column:
Figure 84 The Power Query view of a nondelimited text file.
The first benefit you’ll notice over the standard Excel import process is that you get to work in a full-sized window. That’s a massive improvement all on its own, as you can actually see what is going on.
Note: If your column is too narrow, just mouse over the right side of the column header, hold down the left mouse button, and drag it wider.
You’ll also notice that, since the file is not delimited with any consistent delimiters, Power Query has not made any guesses about your data. Instead, it has left the entire process to you. Given the state of this file, that is probably not a bad thing.
Note: Before we dig any further into this topic, note that there are many ways to import nondelimited text files, and none of them are right or wrong. The example in this chapter has been architected to show a great deal of transformations via the user interface, as well as the typical way an Excel pro might approach this task. With more experience, you’ll find that quicker routes to the end goal are almost certainly possible.
Cleaning Nondelimited Files
The general goal when starting to clean up a nondelimited file is to try to get the data into a semblance of columnar data as quickly as possible. In this case, the top 10 rows don’t seem to add much value, while the 11th row looks like it may be column headers. Therefore, go to Home → Reduce Rows → Remove Rows → Remove top Rows → 10 → OK.
The rows disappear and will not be imported into the end solution:
Figure 85 The top rows are removed, bringing the headers closer to the top.
Next, you need to choose a direction to break into this data. You could try breaking in from the left or right, but currently there are a ton of extra leading spaces, and there are duplicated spaces in the middle. It would be nice to get rid of those.
In Excel it is a standard practice to run textual data through the TRIM() and CLEAN() functions in order to remove all leading, trailing, and duplicate spaces, as well as remove all nonprinting characters. Power Query also has this functionality, and here’s how you apply that now:
The data looks a bit better:
Figure 86 Data trimmed and cleaned.
At this point, you may notice that Power Query’s trim functionality doesn’t work quite the same as Excel’s. While Excel’s TRIM() function removes all leading and trailing spaces and replaces any duplicate spaces in the middle of the data with a single space, Power Query’s doesn’t do that last part. Instead, it only trims off the leading and trailing spaces.
The CLEAN() function in Power Query does line up with Excel’s, although it’s more difficult to see. Nonprinting characters are rendered as a little question mark in a box within the Excel user interface. In Power Query they show as a space. Regardless, if you step back and forth between the Trimmed Text and Cleaned Text steps in the Applied Steps box, you’ll see that the spaces around the & in Avis & Davis have been cleaned away by the Cleaned Text step.
Splitting Columns by Position
The next step is to start splitting apart the columns. The basic approach at this point is to split by the number of characters, making an educated guess about how many you need and then refining that guess. Since the number of characters in the date is 10 characters, you can try 12 for a first go:
That plainly didn’t work out! As you can see below, the date column may be fine, but the others sure aren’t:
Figure 87 The data didn’t split as well as you’d intended.
This is not a big deal, as you can just try again by refining your original efforts:
The result is much better:
Figure 88 A much more inspiring view of the data.
Note: It is also worth mentioning that there is nothing forcing you to choose the Repeatedly setting in the options when splitting columns. If the document is inconsistent, you can choose to split once from the left/right side. This allows you very granular control, on a column-by-column basis.
You can now make two more changes. Since the Changed Type step just declares all the columns as text (which they won’t be when you’re done), you can remove the Changed Type step because it’s irrelevant. You can then promote the first row to column headers. Follow these steps:
The Beauty of Errors in Power Query
The data is now starting to look somewhat cleaned, even if you’d like to change some of the column headers as you go along. At this point, it’s typically recommended that you work from left to right, cleaning up as much of the columnar data as you can and making sure it’s all valid.
If you scroll down at this point, you’ll find that there are a lot of garbage rows in this data, mostly from the repeating page headers and section breaks that were included in the document. The first block of these issues occurs at row 40 and introduces a bunch of ugliness:
Figure 89 Irrelevant rows mixed in with real data.
The question is how to deal with these. Some are dates, some are text, some are nulls. Try this:
The result is a bunch of errors in the Tran Date column:
Figure 90 Errors that result from trying to convert to dates.
Unlike in any other program, errors are truly exciting in Power Query. They are exciting because you can control them and react to them. If you look carefully at this data, you’ll see that errors were caused only in rows that happen to be part of the rows that you want to filter out anyway. In addition, every row that has a null in the Tran Date column holds values in the subsequent columns that are also not part of the transactional data that you want to keep. So you can get rid of both of those:
The results are quite encouraging, and you now have a TranDate column with valid dates from top to bottom:
Figure 91 The TranDate column showing valid dates from top to bottom.
Now, you should recognize that there are a few rows that still appear but don’t belong in this data set. The challenge is that you don’t really want to filter out those dates as some of them might be valid one day (maybe on 3/1/0123). So you can move on to the next column and see if you can fix these issues there:
Power Query attempts to set all the entries to values with decimals, which triggers some errors. Remove those and the nulls out of the data set:
If you now check the data set around row 40 (and further), you’ll see that all the garbage rows are completely gone.
Removing extra columns is very simple: You just need to follow a process when doing so. That process is simply this:
Checking each of the columns in the data set, you can see that the third column (with a blank header) appears to hold only blank values. That column can be removed.
Likewise, if you scroll all the way over to column9, you see that this column holds only null values. That column can also be removed.
Aggregating Columns
At this point, it is fairly clear that your initial splitting of the columns was a bit aggressive. It seems that you have four columns that were broken apart incorrectly, as shown below.
Figure 92 Columns split apart in error.
Fortunately, all is not lost here, and you certainly don’t need to go back and start over. You just need to put the columns back together again. Here’s how:
You’re then given the option of using a separator and providing a new name for the (new) column. In this case you don’t need a separator of any kind. And since you’re going to split this column up differently in a second anyway, the name really isn’t important. Click OK, and your columns are put back together:
Figure 93 Humpty Dumpty wishes he had Power Query!
Splitting Columns by Delimiter
The re-aggregated data makes it very clear that the new Merged column is delimited by the - character. This means you have something you can use to break it apart into its components. One thing to take into consideration is that you don’t know if there is a vendor who uses a hyphen in its company name, so you don’t want to go too aggressive with a split based on the - character. Follow these steps:
Note: You are not limited to delimiters of a single character when splitting by delimiter. In fact, if you want to split by an entire word, you can enter that word as your delimiter.
The data is then split into two separate columns: Merged.1 and Merged.2. These should be renamed to something more sensible:
The result is a data set that is almost perfect:
Figure 94 The data set is now almost perfect.
Trimming Duplicate Spaces
The last thing you need to deal with in this data set is the duplicate spaces that have been left between words in the Vendor column. Since you can’t rely on Power Query’s trim function, you need to take care of this yourself:
You now have a completely clean data set that can be loaded into a table.
Note: Unfortunately, there is no function that easily removes internal “whitespace” from a text string. If you suspect that you have some instances of two spaces, you may have to run this trim process a couple times in order to completely clean the data.
At this point, you can finalize your query and actually build a report from it. Naturally, you’ll do that by creating a PivotTable. Before you can do that, follow these steps:
Power Query’s Moment to Shine
At this point, you should pause and recognize something important: Your data is clean. Unlike when you load data using Excel’s standard method to import from a text file, no further cleanup is necessary. You were able to load, clean, and transform the data in one user interface dedicated to the process. You’re now sitting in a position where the data can actually be used.
Click anywhere in the table and choose to insert a new PivotTable. Place it in G2 of the current worksheet and configure it as follows:
Your PivotTable should look as follows:
Figure 95 A PivotTable built from this chapter’s text file.
This PivotTable is certainly handy, but face it: Everything accomplished in this chapter so far is entirely possible with just standard Excel. So why do you need Power Query? Is it the full-sized window? That’s cool, but it’s not critical.
You see why Power Query is so critical when you deal with the next part of the equation. Next quarter comes along, and you get a new data file. In an Excel pro’s world, that means another tedious afternoon of importing, cleaning, and reformatting. But armed with Power Query, all that changes. Thanks to Power Query, this is all you do:
The query’s output updates the table, but you have to force the PivotTable to update. So do that by right-clicking the PivotTable → Refresh.
Now you see the benefit of using Power Query:
Figure 96 The same PivotTable, now updated for the next quarter.
There are new vendors, new transactions, and new dates, and they’re all working with no issues. It’s revolutionary, and you’re going to wonder how you ever did your job without this help from Power Query.
Note: If you just save the new file over the old one, you don’t even have to edit the Source step to update the file path. Instead, you simply go to Data → Refresh All (twice) to update the solution.
You should also recognize that this isn’t the end of the journey. In this case, you still had to manually edit the Source step to update the path to the new file. Consider the impact if you’d sourced the data file via the From File → From Folder method described in Chapter 4. If you did, you’d be able to take the list of files in the folder and:
What would you have at this point? You’d have a solution that pulled in the data from the most recently modified file in the subfolder. The need to manually change the file path in the query each quarter would be a thing of the past. As each new quarter’s file got added to the folder, you’d be able to just refresh your solution and bring in the most current data automatically.
Warning: If you decide to use the route listed above, be aware that the Date Modified property is changed each time the file is saved. While Power Query won’t change this property by reading from the file, any modifications made and saved by a user will force the Date Modified property to update. If this is a concern, the Date Accessed and Date Created properties may provide better options for you to use as filters, depending on your end goals.