Straight to the Point

The Straight to the Point e-books are designed to thoroughly cover one targeted aspect of Excel.

Excel Data Cleansing
Straight to the Point

Oz du Soleil

Holy Macro! Books

PO Box 541731, Merritt Island FL 32953

Excel Data Cleansing Straight to the Point

© 2019 by Tickling Keys, Inc.

All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information or storage retrieval system without written permission from the publisher.

All terms known in this book known to be trademarks have been appropriately capitalized. Trademarks are the property of their respective owners and are not affiliated with Holy Macro! Books

Every effort has been made to make this book as complete and accurate as possible, but no warranty or fitness is implied. The information is provided on an “as is” basis. The authors and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book.

First Published: July 2019
Author: Oz du Soleil

Copyeditor: Kitty Wilson

Cover Design: Suat M. Ozgur

Cover Illustration: Shannon Mattiza, 6'4 Productions

Indexer: Nellie Jay

Tech Editor: Bill Jelen

Screen Reader Captions: Deb Govern

Compositor: Jill Cabot

Published by: Holy Macro! Books, PO Box 541731, Merritt Island, FL 32953

Distributed by Independent Publishers Group, Chicago, IL

ISBN 978-1-61547-150-8

Table of Contents

About the Author

Acknowledgments

INTRODUCTION

A Data Cleansing Example

Data Cleansing as a Skill

The Straight to the Point Ethos

CORRECTING NAMES: PROPER CASE

COMPARING LISTS: WHAT’S OVER HERE THAT’S NOT OVER THERE?

Invitations and Responses (Match It All Up!)

Determining What’s over There That’s Not over Here

A Word About Strategy

PEELING, PARSING, AND SEGMENTING

Extracting the First Name (Using Flash Fill)

Splitting by a Single Delimiter: Separating the City from the Name

Splitting into Rows: Getting Those People Out of There!

IDENTIFYING DUPLICATE RECORDS: FUZZY MATCHING

Excel’s Duplicate Remover: The Hazard!

Reality, Context, and Strategy: Flagging Records for Review Instead of Clearing Duplicates

MERGING AND APPENDING MULTIPLE WORKBOOKS

FROM USELESS TO USEFUL: FLATTENING A REPORT

Let’s Flatten Some Stuff!

FINAL THOUGHTS

Index

This page intentionally left blank. The Introduction starts on the next page.

About the Author

Oz du Soleil is an Excel MVP who’s been working with Excel since 2001. He’s co-author, along with Bill Jelen, of Guerrilla Data Analysis, 2nd edition. Oz has several Excel courses on the LinkedIn Learning platform. He’s best known for the dramatic and colorful Excel tutorials on his YouTube channel Excel on Fire.

The author, Oz, with arms folded, wearing an MVP jacket and his trademark hat.

Data cleansing is Oz’s area of specialization in Excel. From his earliest days with Excel, he has found himself constantly needing to fix names that are ALLCAPS, peel addresses away from phone numbers, clean up the messes that result from extracting data from PDF files, and fix all the many other things that prevent data from being useful.

Oz has presented Excel topics and master classes at conferences in Amsterdam; Sofia, Bulgaria; São Paulo, Brazil; Toronto, Canada; and cities around the United States.

When Oz isn’t elbows-deep in the guts of a spreadsheet, he does storytelling around Portland, Oregon. He has told stories onstage for Risk!, Pants on Fire, Seven Deadly Sins, Pickathon, The Moth, and other storytelling shows.

Acknowledgments

Thanks to MrExcel, Bill Jelen, for the opportunity to share one of my favorite Excel topics.

INTRODUCTION

A newspaper reporter once described to me the problem she had in getting the public school system to turn over data that it was legally required to turn over. In the face of school closings and outraged parents, the mayor and school administrators kept repeating things like “the data say…” and “the data tell us…”—and the reporter wanted to see this omniscient data.

She eventually did get the data, and it was a mess. The law didn’t require the data to be free of duplicates and ready to be sorted, filtered, and analyzed. And that’s where the investigation ended because the reporter didn’t have the skill or resources to cleanse the data.

Lawyers have complained about similar problems. The opposing side is required to provide data—but that data doesn’t have to be immediately useful. Someone has to take emails, PDFs, and reports from various banks, PayPal, QuickBooks, etc.; get the formats all the same; and then merge them together before any analysis can happen.

Without clean data, no analysis can happen.

This Straight to the Point guide provides an introduction to data cleansing, which also goes by names such as data munging and data wrangling. Whatever the name, it basically means doing what needs to be done to make data useful and trustworthy. Data cleansing can include the following tasks:

A Data Cleansing Example

A program coordinator is all set to analyze some survey results and—oh, no!—the information looks like this:

The data is poorly formatted. Instructor name appears once in A2. Days (Tuesday and Wednesday) appear in B3 and B12. Headings of Survey 1 through 7 appears in C3:C9 (apparently for Tuesday) and then Survey 8 through 12 appears in C12:C16. The actual answers to the surveys is scaled 1-5. There are four questions that stretch across D:G. Some surveys have no answer for a particular question.

The program coordinator would like to do a lot of things with this data, including the following:

But in its current format, this data can’t be used in any of these ways. It would be possible to write delicate complex formulas to analyze the data in its current shape, but not a lot of people have the skills to do that. In any case, there’s a much more powerful way to make this data usable: Cleanse it by turning it into a “flat file” of plain rows and columns, without extra headers and titles. (You’ll learn more about flattening a report in the section "FROM USELESS TO USEFUL: FLATTENING A REPORT" on page 47) Once you’ve turned this report into a flat file, the world is wide open: You can generate a variety of reports, easily merge this data with other flattened data, share it with someone who needs this source data for purposes different from yours, upload it to a database,…ANYTHING!

Data Cleansing as a Skill

The sad fact is that many people who are taught to analyze data are not taught about data cleansing, and many people—like the reporter mentioned earlier—are stumped when they encounter messy data. A marketing student is likely to be taught analysis on clean data. When he gets out of school, if he’s handed data that’s locked in a PDF and that turns into an atrocious mess when it’s extracted from the PDF, will he know what to do with it? Probably not. If he learns data cleansing as a skill, however, he won’t panic when he encounters such data messes. He will be able to develop a confident strategy for extracting data gold from the mud, rocks, and even permafrost.

The Straight to the Point Ethos

Data cleansing is the topic of this short book, and I’m excited to share it with you because this book doesn’t focus on a specific Excel feature. Instead, it is about strategy and getting things done with whatever is available in all of Excel. This book digs into some of the most common data cleansing needs and focuses on strategy and context. (Sometimes, for example, a down-and-dirty one-time solution is most efficient, and other times a Power Query solution that’s more suited for repeated tasks is the best approach.) In the section "COMPARING LISTS: WHAT’S OVER HERE THAT’S NOT OVER THERE?" on page 7 you’ll see multiple solutions for the same problem, followed by a summary of how to pick the best strategy for claiming victory over uncooperative data in a particular set of circumstances.

CORRECTING NAMES: PROPER CASE

Say that you’ve got a list of names that are messy—a mix of ALLCAPS, all lowercase, and Proper Mixed Case. Maybe you need to print mailing labels and realize that the names are not going to work if you leave them as they are.

A list of names in B2:B9 is a combination of lower case, proper case and all caps.

One way to fix them is to use Excel’s Flash Fill. To start, you have to train Flash Fill on what the desired results should look like. In the example below, you would start by typing Kareem Emment in C2. Then, as shown below, when you type the R for Romona, Flash Fill presents the ghosted (or grayed) text as a preview, as if to ask “Is this what you want?”

After typing the name from C2 in B2 and then starting to type the name in C3, Flash Fill guesses what you want. But here, since the first name was already proper case, they guess incorrectly.

If you were to press Enter at this point, the entire ghosted list of data would be committed as actual text. However, that ghosted result isn’t what you want here because most of the names aren’t correct. This is a good example of how you have to keep training Flash Fill by giving it more examples of the desired result.

As shown in the next image, you can keep training Flash Fill by typing the corrections Romona J. Barker and Mary & Gail Dorsey.

Type more names in C3 and C9 to specifically correct some names that are in lower or mixed case.

This should give Flash Fill enough information to know that you want to capitalize the first letter of each name. So next you can select Data > Data Tools > Flash Fill, as shown below.

Select the range C2:C9, which is the range where you want Flash Fill to operate. Then click the Flash Fill icon, located in the Data Tools group of the Data tab of the Ribbon.

As you can see, the result is beautiful!

The names from column B are now mostly perfectly in proper case in column C. The only possible problem is Macarthur in C5 might need to be MacArthur with an interior capitalized "A" after "Mac".

A yellow Caution sign

Oh, well, the result is almost beautiful. In row 5, Zeke MacArthur’s name should include a capital A. This is one of those situations where Excel can get you 99.999% of the way to where you want to be, but you need to understand both your data and Excel to know where there could be trouble spots.

Here you can see what Flash Fill does with some tricky names.

More errors from Flash Fill. The N after the apostrophe in O'Neil is lower case. And a French name of "de la Croix" where the "d" and "l" should be lower case was incorrectly capitalized.

So, after using Flash Fill, you need to go back, check your results, and manually make any corrections to issues such as these. If you’ve got a list of 500 names and need to manually correct 3 of them, Flash Fill has been a huge help!

NOTE: The results of Flash Fill are final. They are not dynamic, and they involve no formulas. Thus, Flash Fill is good for one-time or infrequent needs. For a dynamic solution, you should instead use formulas or Power Query.

In the olden days, fixing the case of names would have required the PROPER function. You can still use this function today, and it is what you could use if you needed a dynamic fix rather than Flash Fill’s one-time static fix. Be warned, though: PROPER runs into the same problem with turning MacArthur into Macarthur, so you still have to check the results to see if any adjustments are needed.

Instead of using Flash Fill, type a formula of =PROPER(B2) in cell C2 and copy down.

COMPARING LISTS: WHAT’S OVER HERE THAT’S NOT OVER THERE?

It’s difficult to avoid working with lists. People who do all sorts of things with data and who have all kinds of different roles have to deal with comparing, merging, or segmenting datasets. These are some of the reasons you might need to compare lists:

How do you compare lists like these to uncover similarities and differences? Keep reading, and you’ll learn the secrets!

Invitations and Responses (Match It All Up!)

This example shows one list of people who were invited to an event and a separate list of people who have responded. In addition, there is data that you’d like to match up for the people in the two lists. For example, you’d like to be able to easily see that Dacia Cohen was invited and plans to bring two guests, and you want to get her email address alongside her name in case you need to contact her.

The left table had Invitee name, Attending, and Guests. The right table has Invited Name and e-mail.

To do everything you want to do here, the best strategy would be to create a master dataset via a Power Query Full Outer Join. This is the best strategy for two reasons:

The datasets are already in tables, so you need to make queries. With the cursor in the Invitee dataset, select Data > Get & Transform Data > From Table/Range, as shown below.

With a cell in the left table selected, choose Data, Get & Transform Data, From Table/Range.

Excel takes you to the Power Query Editor, where you are going to do only one thing: Change the name of the query to Responded, as shown with the arrow below.

The Power Query editor is a second window within the Excel Application. A data preview is shown. On the right side of the editor, use the Name field in the Properties window and type Responded.

Then, in the upper-left corner, click the arrow underneath Close & Load and select Close & Load To.

The Power Query editor has its own Ribbon. On the Home tab, the left-most icon is Close & Load. Open the Close & Load drop-down menu beneath the Close & Load icon. There are two similar choices: "Close and Load" then "Close and Load To...". Choose the second choice: Close & Load To.

Excel opens the Import Data dialog box, which you can use to tell Excel where to load the result. In this case, you should select Only Create Connection because this is an intermediate step with results that you don’t need on a worksheet; you just need the query so that you can merge it with the next query.

The Import Data dialog asks how you want to view this data in your workbook. Choices are Table, Pivot Table, Pivot Chart, or Only Create Connection. Choose Only Create Connection.

Now, in the Queries & Connections pane, you see the Responded query.

Back in Excel, a new Queries & Connections panel appears on the right side of the screen. One query is listed. The panel shows the query name: Responded and is a Connection Only.

Now you need to follow the same steps again, this time starting in the Invited table. Name this query Invited and load it as a connection only.

When you have the two queries, as shown below, you’re ready to merge them and get everything matched up.

A second "Connection Only" query now appears in the Queries & Connections pane.

To start the merge, select Data > Get & Transform Data > Combine Queries > Merge.

Choose Data, Get Data, Combine Queries, Merge from the Excel Ribbon.

In the Merge dialog box, you see the Invited query at the top and the Responded query at the bottom. Highlight the columns that need to be matched—in this case, Invitee and Invited. For Join Kind, at the bottom of the dialog, select Full Outer. Then click OK.

The Merge dialog in Power Query is subtle. The top shows the Invited table. You have to click on the name column. The bottom shows the Responded table. Also click on the Name column. At the very bottom, in the Join Kind drop-down menu, choose Full Outer Join.

In the Power Query Editor, you can see the Invited query, and the Responded query’s data is collapsed in the Responded column. Click the double arrow at the top of the column and deselect Use Original Column Name as Prefix (because all it does is make a big ugly column header that you don’t need; try it out if you want).

In the Power Query editor, you see all of the columns from the left table but only a single column for the right table. This column seems useless, with each row saying "Table". In the heading for this column, an Expand icon appears. Click the icon for a list of fields in the second table. Leave all fields selected, but uncheck Use Original Column Name As Prefix.

Click OK, and you see all the data, now matched up wonderfully.

The Power Query editor now shows all five columns from both tables in a single table. For the rows where there is no match from the second table, those columns say "null". The very last row in the results has "null" in the Invited columns, but a name of Randy Goodman in the Responded column.

Here are some of the things you can see in the data:

A yellow Caution sign

Wait a minute. What’s the deal with Randy Goodman? She wasn’t invited, but she’s showing up as not only attending but as bringing one guest. Who is Randy Goodman?

NOTE: Randy Goodman shows why you did the Full Outer Join here rather than a simple MATCH or VLOOKUP. If you had only asked “Who haven’t we heard from?” you would not have discovered that you heard from someone who wasn’t even invited. Therefore, when comparing lists like this, it’s critical to not only compare List A against List B but also compare List B against List A.

For the time being, you can simply click Close & Load, and you see the final dataset along with a new query named Merge1, as shown below.

The results of the Merge query in a new Excel worksheet. On the right side of the screen, three query names show up in the Queries & Connections pane, with the new query being the actual merge.

Now send someone to find out who Randy Goodman is!

Determining What’s over There That’s Not over Here

A common data analysis need is to determine what’s in one list that isn’t in another. There are a few ways to find this information: by using a formula, by using dynamic arrays, or by using a Power Query Anti Join. The following sections show how to use all three methods, using the example of a catering company that wants to find out which of its available dishes have never been ordered. Each of these methods has pros and cons, and they’re all especially useful in different scenarios.

Two tables in Excel. On the left, a single-column table with a list of Available Dishes. On the right, a list of Orders and the count of how many times it was ordered.

Available vs. Ordered: Formula Method

As you can see above, both datasets—the available dishes and the ordered dishes—are in tables. To find out which of the available dishes have never been ordered by using a formula, you can start a new column named Match in cell C2. In cell C3 you then use this formula:

=MATCH([@[Available Dishes]],Table1[Orders],0)

Every entry with #N/A means that the available dish was not found in the list of orders. A number in the Match column indicates the position in the orders list where the dish was found; for example, lemon cake was found seventh in the list of orders.

A new column with a MATCH formula next to the left table shows which records are missing from the right table. Any formula that returns the #N/A error is missing.

This is a down-and-dirty one-and-done method: You write one formula, and you get an answer. This is the strategy to use if you just need an answer now and don’t expect to ever need to know such information again.

But what if learning which dishes aren’t ordered becomes a regular need? If you keep using the formula you’ve just created and new orders and new available dishes are regularly added to the tables, the #N/A errors will sooner or later become unbearable visual clutter. In such a case, you might instead choose to use dynamic arrays or Power Query to create a summary of just the items that have never been ordered. The following sections describe these two methods.

Available vs. Ordered: Dynamic Arrays Method

Dynamic array functions and the new Excel calculation engine were released to Office 365 Insiders in September 2018—and they’re spectacular!

To see how dynamic arrays work, you can look again at the previous example but now wrap the MATCH function in IFNA, as shown here:

=IFNA(MATCH([@[Available Dishes]],Table1[Orders],0),”x”)

Wrapping MATCH in the IFNA function converts the ugly errors to an "X".

In column H you create a NEVER ORDERED section. Then, in cell H3, you use the new FILTER function, like this:

=FILTER(Table2,Table2[MATCH]=”x”)

To the right of the two tables, a single FILTER function returns a list of all items in the first table without a match.

The NEVER ORDERED column now lists all the dishes that have never been ordered. It’s so easy!

But wait.…There’s more! Scroll down and unhide rows 30 through 36, and you see that there are more items in both the available and orders lists, as shown below.

For those readers following along with the files for this book, there are six new rows of data hiding in rows 30 to 35. Scroll down, unhide the rows, cut, and paste below the existing tables.

Now add the data from rows 30 through 36 to the data above by dragging it up to the respective tables.

With new records added to the Table, the Dynamic Array automatically adjusts to include the extra rows.

Holy moly! The dynamic array—that is, the NEVER ORDERED list—has updated. You wrote just one formula and didn’t have to refresh anything, as you’d have to do with a PivotTable or Power Query.

Available vs. Ordered: Power Query Anti Join Method

Now it’s time to see how Power Query can help you create a summary of just the items that have never been ordered. This method involves a Power Query option known as an Anti Join. A Left Anti Join can be thought of this way: Give me everything that’s on the left side that does not have a match on the right side.

In the Venn diagram below, say that you’re looking at two sets of people: those who live in Chicago and those who work in Chicago. You can see that there’s some overlap, and a number of people, like Leo and Trudy, fit both criteria (that is, they both live and work in Chicago).

A Power Query Left Anti Join isolates the people who live in Chicago and do not work in Chicago—the boldface list on the left below.

A Venn Diagram has a red circle for Lives in Chicago and a blue circle for Works in Chicago. Ten people live and work in Chicago and are inside the intersection of the cirles. But there are other people who only Live in Chicago or only work in Chicago.

To apply a Left Anti Join to the available vs. orders example, you should create an Available query and an Orders query, as shown below.

Rather than using formulas, create a "Connection Only" query for each of the tables. This image shows the Queries & Connections panel with two queries.

You can merge the Available and Orders queries to create a Never Ordered dataset. To do so, select Data > Get & Transform > Get Data > Combine Queries > Merge.

Choose Data, Get Data, Combine Queries, Merge.

Power Query brings you to the Merge dialog box, where you select the Available query at the top and the Orders query at the bottom.

To let Power Query know what you want to match, highlight the Available Orders and Orders columns, as shown below. For Join Kind, at the bottom of the dialog, select Left Anti. Then click OK.

Once again, the subtle Merge dialog box. For this merge, select the Available Dishes column from the Available table. Select the Orders column from the Orders table. In the Join Kind, choose Left Anti Join (Only in First table).

Inside the Power Query Editor, the left-side data (the Available query) is visible. The right-side data (the Orders query) is collapsed in the Orders column.

Initially, the query results show two columns: Available Dishes and Orders. But the Orders column only says Table, Table, Table, all the way down. Choose the Expand Columns icon at the top of that column to reveal the hidden fields.

Click the button with the diverging arrows and, in the dialog that appears, deselect Use Original Column Name as Prefix.

When you use the Expand Column, both fields are already selected: Orders and Count. However, you should unselect the option for "Use Original Column Name as Prefix".

Woah! You get nothing but nulls in the Orders and Count columns. Why? The Left Anti Join retrieved everything on the left side that does not have a match on the right side—exactly what you were looking for.

The resulting query lists all of the available dishes that were never ordered. The Orders.1 and Count column are null for everything.

Highlight the Available Dishes column, right-click, and select Remove Other Columns.

You don't need those null columns. But rather than deleting two columns, right-click on the column heading that you want to keep and choose Remove Other Columns.

Now click Close & Load.

In the Power Query Ribbon, choose Home, Close & Load.

And here is the result:

After choosing Close & Load, the Results appear in Excel. A third query is in the Queries & Connections pane.

On the sheet with the original data, scroll down and unhide the additional data in rows 30 through 36.

If you are following along with the files, unhide rows 30-36 and add the new records to the original tables.

Now add the data from rows 30 through 36 to the existing data, as shown below.

This shows how the original data, formatted as Table, accepts the newly pasted records as part of the table. The formatting extends and the End of Table marker moved to the new last row.

To apply the query steps to the new data, select Data > Queries & Connections > Refresh All. Bang! You get a new result, as shown below.

After choosing Refresh All, the new data is added to the result table.

A Word About Strategy

You’ve just seen three ways to find the same information using various Excel tools. Which one is best? The answer depends on your context. You have to understand your complete need and determine what you are comfortable with. Let’s review:

PEELING, PARSING, AND SEGMENTING

This section looks at the classic problems involved in separating bits of data that are stuck together—a task sometimes referred to as parsing strings of text.

Extracting the First Name (Using Flash Fill)

Parsing strings of text can be a nightmare. For example, say that you have people’s names and cities all in one column, as shown below.

Sixteen rows of data where column B contains Last Name, First Name, City. The goal in column C is to get the first name in All Caps.

Now suppose you’re having an event and would like to print nametags that have each person’s first name in ALL CAPS. To make that happen, you have to parse the data to get those first names out of the first column.

The details in the first column are separated by commas. Note, though, that rows 6 and 9 have additional commas, which complicates the situation. If you tried to use Excel’s Text-to-Column feature here, it would create a mess for you—and more work—because it would separate the data at every comma and put Jr. and CFP into a column with the first names. This is the first of two reasons you don’t want to use Text-to-Column here.

The second reason you don’t want to use Text-to-Columns in this case is that Flash Fill can help you do the job much more easily. However, with Flash Fill, you have to think ahead. Because of the extra commas in rows 6 and 9, you have to do some extra training to help Flash Fill recognize the pattern.

You can start by typing in PHOEBE in C2. Then, as soon as you type A in C3, Flash Fill’s ghosting lets you know that it will give you the wrong results for rows 6 and 9.

One way to Flash Fill is to type PHOEBE in C2 and then start to type A in C3. But without enough examples, Flash Fill will incorrectly show Junior, Chuck instead of Chuck.

To properly train Flash Fill, you can press Esc and type in CHUCK and HARRIET, as shown below.

Instead, Oz suggests typing PHOEBE in C2, CHUCK in C6 and HARRIET in C9.

Next, highlight the entire column where you want the final names to appear and choose Data > Data Tools > Flash Fill.

With three examples typed, select C2:C17 and press Ctrl+E to invoke Flash Fill

As you can see below, this does the trick, and you get exactly what you wanted!

This time, Flash Fill performs correctly.

Splitting by a Single Delimiter: Separating the City from the Name

Now, using the same data as for the nametag example, say that you need to separate the names from the cities. This would be easy enough to do with Flash Fill, but what if you’ll be adding more data to the list and need to repeat the parsing of the names from the cities? Two words: Power Query!

The data in B has Last Name, First Name, City. The goal is to get Name in E and City in F.

With the cursor in the dataset, you start a Power Query solution by selecting Data > Get & Transform Data > From Table/Range. As shown below, select the checkbox My Table Has Headers. (Yes, the table does have headers.)

Although the original data was not formatted as a table, choose From Table/Range. Excel asks if the Table has headers, just as if you had pressed Ctrl+T.

With the column highlighted, select Home > Split Column > By Delimiter.

In the Power Query editor, choose the only column of data and choose Home, Split Column, But Delimiter. Of interest but not mentioned in the book are the other choices here: By Number of Characters, By Lowercase to Uppercase, By Uppercase to lowercase, By Digit to Non-Digit, By Non-Digit to Digit.

Next, you need to configure the Split Column by Delimiter dialog settings as shown in the image below. One thing that’s vital that you cannot see in the image below is that the custom delimiter is a comma with a space after it. That is, you need to split by a comma and a space, not just a comma. If you do this, you won’t end up with leading and trailing spaces when you’re done with the splitting, and you won’t have to take the extra step of clearing those extra spaces. There’s no reason to work that hard today.

Also note in the image below that you need to split at the right-most delimiter. The Right-Most Delimiter option is beautiful because it allows you to surgically parse strings of text rather than use the ham-fisted Text-to-Columns feature.

This is the Power Query Split Column By Delimiter dialog box. In the Select or Enter Delimiter box, choose Custom. In the next box type a comma and a space. In the Split At section, choose Right Most Delimiter.

Click OK, and you see the result below. Done! Split! You got it!

Still in the Power Query editor, the first column has names and the second column has Cities. The headings are silly: Last Name, First Name, City.1 and Last Name, First Name, City.2.

The last step is to rename the columns Names and Cities. Double-click the column heading and type a new heading.

Rename the columns to be Names and Cities.

Next, select Close & Load > Close & Load To and load the query result to start in cell E3, as shown below.

Instead of Close and Load, choose Close and Load To. In the Insert Data dialog, you can specify that the results should be returned to a specific cell on the current worksheet.

The result, as you can see here, is a beautifully cleansed dataset with names formatted consistently and correctly in one column and cities in another column.

The original column with Name, City is shown starting in B1. The result of the query, now two columns is in E3 on the same worksheet. By default, the result of the query appears on a new worksheet. Using Close and Load To allows you to specify the results appear in the current worksheet.

As hinted at earlier, the benefit of doing this in Power Query is that if you add more data, Power Query applies the same data cleansing to the new data for you. To see how it works, go back to the source data, scroll down, and unhide the data in rows 25 through 30.

Once again, Oz has more data hidden below the original data. Unhide this data and add it to the table above.

Drag that data into the table of names and cities above.

Cut the new records and paste below the original table in B. The table incorporates the new records. But the result table in E3 has not yet updated.

Now all you have to do is refresh, and Excel reruns the query with the new data. Select Data > Queries & Connections > Refresh All, and there’s the newly added data, looking just the way you want it!

After clicking Refresh All, the results table updates to include the extra rows.

Splitting into Rows: Getting Those People Out of There!

Now let’s look at the classic problem of too much stuff in a single cell. This often happens when data is copied from an email or a web page and then pasted into Excel. To see how to deal with this issue, say that you have guests and their table assignments, as shown below.

This is an ugly data set. Column B lists Table 1 through Table 8. Column C contains the guests for each table, but as a list of names separated by semi-colons.

Because of the way the data is formatted, you can’t do any of the following operations:

How can you cleanse this data so that it’s useful? Use Power Query!

Place the cursor in the data range and select Data > Get & Transform Data > From Table/Range.

In this case, you can take advantage of the fact that the names are separated by semicolons. Inside the Power Query Editor, highlight the GUESTS column and then select Home > Split Column > By Delimiter.

In Power Query, select column C with Guests. Choose Split Column, By Delimiter.

As shown in the image below, you can indicate the semicolon as the custom delimiter and select Each Occurrence of the Delimiter to split at each semicolon. Then expand the Advanced Options section and choose Split into Rows. Finally, click OK.

In the Split Column By Delimiter dialog, choose a Custom Delmiter with semi-colon and space. In the Advanced Options section, choose Split Into Rows.

And here you can see the final data—ready for use!

Instead of a single row for Table 1, there are now six rows, each with a different guest.

Click Close & Load, and you see the table shown at left below. You can also create a PivotTable, as shown at right below, to tally the number of people at each table.

With the query results back in Excel, make a pivot table to count guests by Table. It is now obvious that the tables are not balanced: some have 8, but others only have 2 or 3 guests.

IDENTIFYING DUPLICATE RECORDS: FUZZY MATCHING

Duplicate records are insidious. They create problems ranging from mildly embarrassing to catastrophic.

These are some of the many ways duplicates can end up in a dataset:

Duplicates in data can cause simple consequences like the embarrassment of mailing two duplicate pieces of mail to the same person. But duplicate records can also cause big problems, like jeopardizing someone’s professional license—which I’ve actually seen happen. That kind of problem causes a great deal of undue stress!

Excel’s Duplicate Remover: The Hazard!

Let’s look at a simple example of identifying duplicate records. Say that you have a list of 14 records, as shown here, and you’d like to know how many unique items you have. You really don’t have 14 items, though: It’s easy to see that small charcoal shirts is entered three times, medium charcoal shirts is in here twice, and glitter green socks is entered twice.

The four columns are Color, Item, Size, and Notes. Two records say Charcoal, Shirt, Medium and are obviously duplicates. But for Charcoal, Shirt, Small, one of the three records has a note in column F that 50 were ordered. Since this note is on the second duplicate, it will be lost using Excel's Remove Duplicates column.

Excel has a duplicate remover that can help out here. Select Data > Data Tools > Remove Duplicates. Then, in the Remove Duplicates dialog box, select all columns except NOTES.

Select Data, Remove Duplicates. In the Remove Duplicates dialog, unselect the Notes columns

A yellow Caution sign

Look at what happens when you click OK. Excel picks the items to get rid of. Its results are fine if you don’t care about the notes and really wanted just a list of the unique items. But if you did need that note about the charcoal shirts that are due Thursday, you’re in trouble because that note is now gone.

Excel reports that 4 duplicate values were found and removed. 10 Unique Values remain. But the note about 50 shirts being ordered was lost.

The moral of the story: Use Excel’s duplicate remover only when it’s okay for Excel to pick what to delete. The next section shows a safer way to find duplicates. In that section, you will also get a taste of what’s formally called fuzzy matching.

Reality, Context, and Strategy: Flagging Records for Review Instead of Clearing Duplicates

Fuzzy matching is the sometimes grueling task of matching entries that aren’t identical. It can be as simple as using an employee ID to match Kate Montgomery with Kate A. Montgomery. In more difficult cases, you have to deal with entries that have multiple legitimate variations, like these:

At this level, matching duplicates can be an art, or it can be tedious. You have to decide how much effort to put into this and when it’s time to accept that the data won’t ever be 100% clean.

When you’re looking for duplicates, it’s important to think about context and strategy. In the previous example, you let Excel delete duplicates for you. However, when you’re out in the real wilderness, battling messy data, it’s often better to flag potential duplicates and check a couple issues:

To take a peek at how this can play out, consider the data in the following example.

Five columns of data: First name, Last name, Address, Zip Code, and e-mail.

The data here is just foul:

Because of these wrinkles, this is a case where it’s best to flag possible duplicates for review rather than let Excel just delete them for you. You also need a lot more information based on the context. What are the stakes if you get the results wrong? Consider:

One trick for flagging duplicates is to create a unique identifier, such as the ones in column F below, and then use conditional formatting on it. In this case the ID in column F is built from pieces of the source data, using the following formula:

=LEFT([@[First Name]],4)&[@[Zip Code]]&RIGHT([@[Last Name]],5)&MID([@Address],2,4)

A formula in column F creates a key field with bits of each of the other four columns.

The formula takes the following information:

Next, to use conditional formatting to flag possible duplicates, highlight the ID column and select Home > Styles > Conditional Formatting > Highlight Cell Rules > Duplicate Values.

Select the values in the Key Field and choose Duplicate Values from the Conditional Formatting choices.

And here is the result:

Anything highlighted in pink is potentially a duplicate record.

Both Leslies and two of the three Zebs are flagged. You could delete one of the Leslies and one Zeb, but you know you have a third Zeb. How you approach the situation from here is all about strategy and deciding how much effort is needed to get the data to an acceptable level of cleanliness.

If you need a result that’s better than this, you might work on the list in chunks. For example:

1.Separate all records that are clearly, without a doubt, unique. Get them out of the way.

2.Flag unique addresses. Isolate them and identify duplicates vs. multiple people with the same address.

3.Flag unique names. Isolate them to identify duplicates vs. different people with the same name.

4.Use more than one unique identifier to isolate possible duplicates in different ways.

5.As records get verified as unique, move them over to the first list that you know is clean.

6.At the end, there are usually a few very difficult entries that can only be handled manually—perhaps even by finding contact information and asking the people themselves for corrections.

How far you go depends, again, on the stakes.

MERGING AND APPENDING MULTIPLE WORKBOOKS

Now say that you have a folder named FolderImport, and it has five files containing data that you’d like to have stacked up in one place. In this case, you are ordering shirts for all active members and need to know how many of each size to order.

Five Excel Workbook icons represent five workbooks in a folder.

The files look like this:

This shows the first few records of all five files. While all start with Name, some worbkooks are missing columns.

You can quite easily make the following observations about these files:

Before Power Query came along, getting this type of data stacked up was a copy/paste nightmare. It is now much easier to merge these files: Start by opening a new workbook and then select Data > Get & Transform > From File > From Folder.

In Excel, select Data, Get Data, From File, From Folder.

Navigate to the location where you’ve stored the FolderImport folder.

Browse to the folder that contains the files to import.

Next, you see a list of all the files in the folder. Select Combine & Edit.

Power Query shows you a list of five workbooks. In the buttons at the bottom, choose Combine, then Combine and Edit.

In the Combine Files dialog box, highlight Sheet1. In the Example File drop-down, you have to select one of the files that has all of the headers—so in this case, select SW.xlsx.

A yellow Caution sign

If you select NW.xlsx for Example File, Power Query will import only the three columns contained in that file.

Select which workbook should be used as the sample of what columns to expect.

Here is the data! Everything from the five files has been merged, and the columns are all lined up. You’re done!

The results in the Power Query editor show all of the rows from all five workbooks. A new first column identifies from which file each record came from.

NOTE: Notice that Central.xlsx doesn’t have a Shirt Size column, and Power Query handles this by adding a column, adding the column heading Shirt Size, and filling the column with null.

Now select Close & Load, and the result is 98 rows of data from the five source files.

When you close & load back to Excel, several queries appear in the Queries & Connections panel. The last one is the one with all of the records.

If you add a PivotTable and a slicer and filter for only active members, you see that you need 16 more people to let you know their shirt sizes.

A pivot table has shirt size across the top and file name down the left. There are sixteen people with a shirt size of (blank).

FROM USELESS TO USEFUL: FLATTENING A REPORT

A common and often useless starting point for data is a report format. A report is designed to give human beings a predefined view of specific data to answer specific questions. However, what if the report does not answer the questions you’re asking? In this situation, you need a different view of the data, but it might be difficult to get someone to generate a nice pretty report for you. In such cases, it is incredibly helpful to be able to flatten a report into a flat file—that is, turn the report into a simple wall of rows and columns of data. The flattened result isn’t easy for a human to read. However, flattening a report makes it infinitely easier for a human to get Excel to generate any imaginable report that’s based on the data.

Below, the survey results on the left can tell you that Jennifer Fletcher captured 12 surveys over the two days that she presented, and you could use the information shown to calculate some averages. The survey results on the right have been flattened, and although the data isn’t pretty, it’s now in a format that will allow you to look for many other insights:

The ability to flatten data makes you practically invincible.

This compared the original report with the flattened result. The original report is in an Outline format with many blank values in columns A&B that essentially mean "use the previous non-blank value in this column". Also, the questions appear across the top of the report. In the Flattened results, the blanks are filled in and the questions are unpivoted.

Let’s Flatten Some Stuff!

In this section, you’re going to use Power Query to flatten the data from the survey report shown above. But you’re not going to use From Table/Range. Putting the report into a table would be a bit goofy because the report has headers all throughout; it has different sections that don’t go together; and in some situations, reports like these have empty columns and rows that you could easily miss if you’re not paying attention.

Rather than use a table, it’s better to open a brand-new workbook and import the entire worksheet. So, with a new workbook open, select Data > Get & Transform > Get Data > From File > From Workbook.

Get Data, From File, From Workbook.

You now see the Import Data dialog, in which you need to navigate to the location where you have stored the file Survey Results.xlsx.

NOTE: This technique creates a file path between the new Excel workbook and the source workbook. If you anticipate that changes will be made to the source data, you need to store the source workbook in a location that is not likely to change. If the location does change, you need to get into the query and navigate to the new source location.

The Navigator pane gives you a preview of the data, and you can choose to import the entire workbook Survey Results.xlsx or just Sheet1.

The poorly formatted Original Report shows in the preview of the Navigator dialog. Choose to import the entire sheet.

In this case, you want to select the folder to import the entire workbook. Then click Transform Data.

In the Navigator, select Survey Results.xlsx in order to import the entire workbook.

The dataset is in the collapsed Data column, and in this case you don’t need the other columns.

The initial result in Power Query looks unusual. It is one row. The Name is Sheet1. The Data column says Table. The Item and Kind columns both say Sheet1 or Sheet. The Hidden column is False.

Highlight the Data column, right-click, and select Remove Other Columns.

RIght click on the Data column and choose Remove Other Columns.

Click the diverging arrows to expand the column and deselect Use Original Column Name as Prefix.

With only one cell in the result, click the Expand Icon. At the top, you have the option to Expand or Aggregate. Choose Expand. Leave All Columns selected. Unselect Use Original Column Name as Prefix. Click OK.

After you click OK, you see your fabulous data, all ready to be flattened.

You now see the unformatted data from the Excel sheet in the Power Query Editor.

It takes several strategic steps to convert the data from its current format into a wall of columns and rows. First, you don’t need the very top row, so filter Column1 and deselect Instructor.

Use the Filter drop-down at the top of Column 1 to remove the Instructor field.

Next, you need to fill in Column1 and Column2 by using Fill > Down.

Select the mostly blank columns. Right-click and choose Fill, Down.

Next, rename Column4, Column5, Column6, and Column7 to Question1, Question2, Question3, and Question4, respectively.

Rename the columns to be Question 1 through Question 4.

You don’t need the question headers or the rows with Scale: 1 – 5, but you have to be careful about their removal. In Column3, filter out null.

To get rid of additional heading rows, filter out anything that is Null in column 3.

Rename the first three columns Instructor, Day, and Survey.

Rename other columns.

Now you’re about to get into some serious flattening power. You need two columns:

That data is in the columns Question1 through Question4, which you’re going to unpivot. Highlight the columns Instructor, Day, and Survey, right-click, and select Unpivot Other Columns.

Choose the label columns that should not be part of the Unpivot operation. Right-click and Unpivot Other Columns.

There you have it! Doesn’t it feel good?

The four pivoted columns now appear down a column called Attribute. The answer for each question appears in a new column called Value.

Rename the new columns Questions and Scores and click Close & Load. Now, as shown below, this data is flatted—and it’s useful.

Return the data to Excel.

Here’s an example of this data inside a quick-and-easy PivotTable, showing average by day and by question.

The results in a pivot table show average rating for each question by day.

FINAL THOUGHTS

A lot of data cleansing needs are unique. You’ll see a particular situation once and never see it again. Therefore, like a blue heron, you must be patient, relax, focus, view the whole situation, and strike when you’re confident that you have a solution. The alternative is to dive into the confusion and possibly make a situation worse.

A photo of a blue heron

Many times I’ve seen people faced with complex data cleansing tasks, trying to figure out what to do with data that’s 15,000 rows and 40 columns, spread across 3 workbooks and 12 worksheets. That can involve a lot of scrolling, switching around, highlighting columns, trying this and trying that, getting lost, and thinking too much about the data and not enough about the solution.

My advice? Slow down. Make a small model. Maybe work with 30 rows of data instead of 15,000; start with just enough data to get a fair representation of the problem. Break the task into pieces and get each piece working before integrating the pieces into a final solution.

In closing, here are a few data cleansing tips and review points:

Index

Symbols

& to Concatenate 40

A

Appending workbooks 43

Available vs. Ordered

Dynamic arrays method 15

Formula method 14

Power query method 16

C

Close & Load To 9

Comparing Lists 7

Conditional Values

Duplicate records 41

D

Delimiter

Right-most 30

Duplicate records 37

Conditional format 41

Remove Duplicates 38

Dynamic array functions 15

E

Excel on Fire 1

Expand column icon 12

F

Fill down in Power Query 51

FILTER() function 15

First Name, extracting 24

Flagging records 39

Flash Fill 24

for Proper Case 4

ghosting 26

versus Power Query 6

Flattening a Report 47

From Folder 43

Full outer join 8

Fuzzy matching 37

G

Goodman, Randy 13

Guests & Table Assignments 34

I

IFNA() function 15

Import Data dialog 9

Invitations and Responses 7

J

Jelen, Bill 1

Join types

Full outer 8

Left Anti Join 16

L

Lawyers 1

LEFT() function 40

LinkedIn Learning 1

M

MATCH() function 14

Match It All Up 7

Merge query 11

Merging workbooks 43

MID() function 40

O

Only Create Connection 9

P

Parsing 24

Peeling 24

Pivot table 37, 47, 54

Power Query

Comparing lists 7

Fill down 51

From Folder 43

From Workbook 48

Load to specific cell 31

Refresh All 33

Renaming column 31

Split by Delimiter 29

Split to Rows 34

Unpivot Other Columns 52

Proper Case 4

PROPER function 6

Prudhomme Memorial Library 39

Q

Queries & Connections pane 10

Query name, changing 9

R

Refresh All 23

Remove Duplicates 38

Remove other columns 20

Reporter 1

RIGHT() function 40

Right-most delimiter 30

S

Segmenting 24

Slicer 47

Split Column in Power Query 30

Split to Rows 34

T

Text-to-Column

versus Flash Fill 25

U

Unpivot with Power Query 52

V

Venn diagram 16

Straight to the Point

Front cover for Excel Subtotals Straight to the Point by Bill Jelen
Front cover for Excel Dynamic Arrays Straight to the Point by Bill Jelen

Have a topic that you would like to write about for the Straight to the Point series?

We are looking for 50-page books on any specific topic covering Excel, Power Pivot, Power BI or any techie topics. Earn generous advances and revenue sharing. Send your idea to Pub@MrExcel.com.