Data Cleaning and Munging

Data cleaning means that you get rid of any information that doesn’t need to be there and clean up any mistakes. Your approach is going to depend on the data you have gathered, and what your end goal is.

Here is a general guideline on how to clean data.

Does what you’re reading make sense?

First, you have to look over your data. If your dataset is fairly large, look at the top 20 rows, the bottom 20 rows, and then another random 20-row sample. The following are some questions that you need ask yourself about the data.

Does this data make sense?

Does the data you’re looking at match the column labels?

Example: Do you have names in the column labeled name, addresses in your column for addresses, phone numbers in the numbers column? Or do you have different data in the columns.

Does your data follow the rules for its field?

Example: Are the names written in alphabetical letters, Josh, or does it contain numbers like?

Example: Is the phone number 10 digits, 3331234567, or not, 1234567?

After you compute summarizes statistics for your numerical data, does it make sense?

Example: When you have time elapsed data, is your minimum value negative?

Example: If you are working with annual wages, is the max value an outlandish number?

How many of the values are nulls? Is the number of nulls acceptable? Can you find a pattern to the null values?

Do you see duplicates, and are they okay?

If you have enough information, you need to correct the values.

Once you have noticed all the problems within your data, start to think about ways to fix it. There are some corrections that will be obvious, such as a name being written like J4osh. You know it should be Josh.

However, there could be corrections that aren’t as obvious. Let’s say that your data deals with more than 500,000 people and somebody has the name of Brick. You may think the name is a type and that it should say Rick. You could make the change to this if you have the information that it is, in fact, an error, such as you know the person that the data belongs to. But unless you do have a lot of confidence, you probably shouldn’t change that type of difference, or you could end up making a typo.

Try to figure out if the data issues have any logic to them and make corrections needed. If your system only allows phone numbers with numbers and no special characters, but your data has phone numbers written like so (333)123-4567, you will be able to use Excel or Python to find and get rid of the dashes and parentheses.

If you are able to, you can use this logic to fill in data for the null values. However, if you’re not able to find logic, there are different ways that you could address filling in any missing data. One of the common ways to do this is by using the median or average of the column to fill in the blank spots, which is quite easy but could affect your data distribution. There are also some complicated, but statistically solid, methods to fill in your blank spots, like MICE. You may even want to create a column of metadata that indicates the values that are imputed and the values that are organic. No matter which method you choose, if you have a huge number of nulls for a column, it may be in your best interest to forgo using it in your analysis.

Make use of the tools that makes sense.

When you decide to think about the tool that you need to reach your goal the right tool can end up saving you time from having to do things over and over again. Each tool has its own pros and cons, so here is a bit of a rule of thumb.

You can use Excel when:

Defining the logical pattern for cleaning your data is hard to figure out, and you have to clean it manually.

You have a logical pattern in cleaning the data, and you can easily clean it with Excel functions.

The job needs to be done quick and easy.

You have less than a million records.

You could use Python or other scripting languages when:

You have a logical pattern for cleaning, but it would be hard to do with Excel functions.

The job will be done on a repeating basis.

You have to document the process.

The tools that you are more comfortable with will also play a role in this.

Start communicating with your source.

If you discover that you can’t read something in your data, don’t be afraid to call somebody or email them. For example, if you were given data that dealt with information from every county in a state, and you had to have the county names, but all the data gives you are number codes, you should call the person who sent you the data. They can shed some light on how they organized their information. This is even more so true if you are the client of the data source because you are guaranteed to receive clear information. Communicating with them will save you a lot of time and heartache.

This is one of the main, and most basic, ways of cleaning your data, but it isn’t the only approach you could take. The more experienced you become, the more creative you can become in cleaning your data.

Data munging, sometimes known as data wrangling, is the process of manually changing or mapping data in one raw form into a different form that will gives you a more convenient use of your data through semi-automated tools.

This could mean that you modify the values into a column in a particular way or grouping several columns together. The need for data munging is typically from poorly presented or collected data. Data that has been entered manually normally have a lot of errors and the data that is collected through websites are normally optimized to be shown on websites and not aggregated and sorted.

The next steps we’re going to go through will be looking at how to mung data. It is important that you don’t try to follow along with these steps until you do the Python setup that we will go through in the next chapter. The setup in the next chapter will get you t where you need to be to start these steps.

1. Look for missing values in your dataset.

You can continue using your own data set for this, but for the purpose of the tutorial, I will be using random data information so that I can be more direct with my explanation. We’re going to start by looking at Cabin. When you first look at the variable, it will leave you with the impression that there may be too many nulls in the set of data. Let’s check how many nulls are in the dataset.

“sum(df [ ‘Cabin’ ].isnull())”

This command will tell you the amount of missing values, or the values that are null. With this, you should get a return of 687, which means there are a lot of missing values and you should drop this variable.

Now we need to look at Ticket. The variable Ticket looks like it has a mixture of text and numbers, and there doesn’t seem to be any information. We should drop Ticket too.

“df = df.drop([ ‘ Ticket’, ‘Cabin’], axis=1)”

2. Filling in the missing age values.

There are a lot of different ways for you to fill in the missing age values. The simplest would be to replace it with the mean, and can be done like this:

“meanAge = np.mean(df.Age)

df.Age = df.Age.fillna(meanAge)”

The other thing you could do is to create a supervised learning model that could predict ages based on all the other variables, and it would then use age as well as the other variables to predict survival.

Since we are trying to learn data munging, it would be best if we take an approach that is somewhere between these two extremes. The main hypothesis would that the information in pclass, name, and gender combined is able to give us the information we need to figure out the missing age values.

The following are the steps that you need to take to work this hypothesis:

Step 1: Extract information from Name

We are going to create a function that will extract the information in Name so that it is written as such:

Family_Name, Salutation. First Name

“def name_extract(word):

return word.split( ‘,’) [1].split( ‘.’) [0].strip()”

With this code, it would change “Jain, Mr. Kunal” to Mr. and “Jain, Miss. Jenika” to Miss. Then we can apply this function so that it will change the whole column.

“df2 = pd.DataFram({ ‘Salutation’ :df[ ‘Name’].apply(name_extract)})”

After we have the salutations, we can look at how they are distributed. We will now use the groupby after we merge the DataFrame df2 with DataFrame df:

“df = pd.merge(df, df2, left-index = True, right_index = True) # merges on index

temp1 = df.groupby( ‘Salutation’).PassengerId.count()

print temp1”

The output you should get is:

Capt 1

Col 2

Don 1

Dr 7

Jonkheer 1

Lady 1

Major 2

Master 40

Miss 182

Mlle 2

Mme 1

Mr 517

Mrs 125

Ms 1

Rev 6

Sir 1

The Countess 1

dtype: int64

From the information, you can tell that there are four main salutations: Master, Mr, Miss, and Mrs. The other salutations are a lot less common. This means that we will group all of the remaining salutations into one salutation named others. To do this, we can use the same approach that we did to get all of the salutations.

“def group_salutation(old_salutaion):

If old_salutation == ‘Mr’:

Return( ‘Mr’)

Else:

If old_salutation == ‘Mrs’:

Return( ‘Mrs’)

Else:

If old_salutation == ‘Master’:

Return( ‘Master’)

Else:

If old_salutation == ‘Miss’:

Return( ‘Miss’)

Else:

Return( ‘Others’)

Df3 = pd.DataFram({‘New_Salutation’ :df[ ‘Salutation’] .apply ( group_salutation)})

Df = pd.merge(df, df3, left_index = True, right_index = True)

Temp1 = df3.groupby( ‘New_Salutation’).count()

Temp1

Df.boxplot(column= ‘Age’, by = ‘New_Salutation’)”

You should then receive a list of new salutations that would look like this:

Master 40

Miss 182

Mr 517

Mrs 125

Others 27

Step 2: Create a simple grid

We now want to make a Pivot table that will show us the median values of the class, gender, and age. We will define a function that will give use the values of our chosen cells while also filling in all of the missing ages.

“ table = df.pivot_table(values= ‘Age’, indext=[ ‘New_Salutation’ ], columns=[ ‘Pclas’, ‘Sex’ ], aggfunc=np.median)

# Define function to return value of this pivot_table

Def fage(x):

Return table[x[ ‘Pclass’ ]][x[ ‘Sex’ ]][x[ ‘New_Salutation’]]

# Replace missing values

Df[ ‘Age’ ].fillna(df[df[ ‘Age’ ].isnull()].applu(fage, asix=1), inplace=True)”

This will give you a pretty good way to add in the missing age values.

How should you treat the outliers in the distribution of fare?

As we figured the means of fare match up fairly well with Pclass. However, there are some extreme outliers. There is one point of data that probably grabs your attention, the fare of 512 for class one. Chances are this is an error. There are several ways to change the data. Replace it with the mean or median of class one, or you could also change the value with the second highest value, which will relate closer to the other points of data.

You can decide which one to chose and replace the respective values. The commands work similarly to the ones we’ve already went through.

You now have a set of data that you can use to build a predictive model.