Probably the most common and easiest of our two options for dealing with missing data is to simply remove the observations that have any missing values. By doing so, we will be left with only the complete data points with all data filled in. We can obtain a new DataFrame by invoking the dropna method in pandas, as shown in the following code:

# drop the rows with missing values
pima_dropped = pima.dropna()

Now, of course, the obvious problem here is that we lost a few rows. To check how many exactly, use the following code:

num_rows_lost = round(100*(pima.shape[0] - pima_dropped.shape[0])/float(pima.shape[0]))

print "retained {}% of rows".format(num_rows_lost)
# lost over half of the rows!

retained 49.0% of rows

Wow! We lost about 51% of the rows from the original dataset, and if we think about this from a machine learning perspective, even though now we have clean data with everything filled in, we aren't really learning as much as we possibly could be by ignoring over half of the data's observations. That's like a doctor trying to understand how heart attacks happen, ignoring over half of their patients coming in for check-ups.

Let's perform some more EDA on the dataset and compare the statistics about the data from before and after dropping the missing-values rows:

# some EDA of the dataset before it was dropped and after

# split of trues and falses before rows dropped
pima['onset_diabetes'].value_counts(normalize=True)

0    0.651042
1    0.348958
Name: onset_diabetes, dtype: float64

Now, let's look at the same split after we dropped the rows, using the following code:

pima_dropped['onset_diabetes'].value_counts(normalize=True)  

0    0.668367
1    0.331633
Name: onset_diabetes, dtype: float64

# the split of trues and falses stay relatively the same

It seems that the binary response stayed relatively the same during the drastic transformation of our dataset. Let's take a look at the shape of our data by comparing the average values of columns before and after the transformation, using the pima.mean function, as follows:

# the mean values of each column (excluding missing values)
pima.mean()

times_pregnant                    3.845052
plasma_glucose_concentration    121.686763
diastolic_blood_pressure         72.405184
triceps_thickness                29.153420
serum_insulin                   155.548223
bmi                              32.457464
pedigree_function                 0.471876
age                              33.240885
onset_diabetes                    0.348958
dtype: float64

And now for the same averages after dropping the rows, using the pima_dropped.mean() function, as follows:

# the mean values of each column (with missing values rows dropped)
pima_dropped.mean()

times_pregnant                    3.301020
plasma_glucose_concentration    122.627551
diastolic_blood_pressure         70.663265
triceps_thickness                29.145408
serum_insulin                   156.056122
bmi                              33.086224
pedigree_function                 0.523046
age                              30.864796
onset_diabetes                    0.331633
dtype: float64

To get a better look at how these numbers changed, let's create a new chart that visualizes the percentages changed on average for each column. First, let's create a table of the percent changes in the average values of each column, as shown in the following code:

# % change in means
(pima_dropped.mean() - pima.mean()) / pima.mean()

times_pregnant                 -0.141489
plasma_glucose_concentration    0.007731
diastolic_blood_pressure       -0.024058
triceps_thickness              -0.000275
serum_insulin                   0.003265
bmi                             0.019372
pedigree_function               0.108439
age                            -0.071481
onset_diabetes                 -0.049650
dtype: float64

And now let's visualize these changes as a bar chart, using the following code:

# % change in means as a bar chart
ax = ((pima_dropped.mean() - pima.mean()) / pima.mean()).plot(kind='bar', title='% change in average column values')
ax.set_ylabel('% change')

The preceding code produces the following output:

We can see that the number of times_pregnant variable average fell 14% after dropping missing values, which is a big change! The pedigree_function also rose 11%, another big leap. We can see how dropping rows (observations) severely affects the shape of the data and we should try to retain as much data as possible. Before moving on to the next method of dealing with missing values, let's introduce some actual machine learning into the mix.

The following code block (which we will go over line by line in a moment) will become a very familiar code block in this book. It describes and achieves a single fitting of a machine learning model over a variety of parameters in the hope of obtaining the best possible model, given the features at hand:

# now lets do some machine learning

# note we are using the dataset with the dropped rows

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

X_dropped = pima_dropped.drop('onset_diabetes', axis=1)
# create our feature matrix by removing the response variable
print "learning from {} rows".format(X_dropped.shape[0])
y_dropped = pima_dropped['onset_diabetes']


# our grid search variables and instances

# KNN parameters to try
knn_params = {'n_neighbors':[1, 2, 3, 4, 5, 6, 7]}

knn = KNeighborsClassifier() . # instantiate a KNN model

grid = GridSearchCV(knn, knn_params)
grid.fit(X_dropped, y_dropped)

print grid.best_score_, grid.best_params_
# but we are learning from way fewer rows..

OK, let's go through this line by line. First, we have two new import statements:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

We will be utilizing scikit-learn's K-Nearest Neighbors (KNN) classification model, as well as a grid search module that will automatically find the best combo of parameters (using brute force) for the KNN model that best fits our data with respect to cross-validated accuracy. Next, let's take our dropped dataset (with the missing-valued rows removed) and create an X and a y variable for our predictive model. Let's start with our X (our feature matrix):

X_dropped = pima_dropped.drop('onset_diabetes', axis=1) 
# create our feature matrix by removing the response variable
print "learning from {} rows".format(X_dropped.shape[0])

learning from 392 rows

Ouch, it's already obvious that there's a major problem with this approach. Our machine learning algorithm is going to be fitting and learning from far fewer data observations than which we started with. Let's now create our y (response series):

y_dropped = pima_dropped['onset_diabetes']

Now that we have our X and our y variable, we can introduce the variables and instances we need to successfully run a grid search. We will set the number of params to try at seven to keep things simple in this chapter. For every data cleaning and feature engineering method we try (dropping rows, filling in data), we will try to fit the best KNN as having somewhere between one and seven neighbors complexity. We can set this model up as follows:

# our grid search variables and instances

# KNN parameters to try

knn_params = {'n_neighbors':[1, 2, 3, 4, 5, 6, 7]}

Next, we will instantiate a grid search module, as shown in the following code, and fit it to our feature matrix and response variable. Once we do so, we will print out the best accuracy as well as the best parameter used to learn:

grid = GridSearchCV(knn, knn_params)
grid.fit(X_dropped, y_dropped)

print grid.best_score_, grid.best_params_
# but we are learning from way fewer rows..

0.744897959184 {'n_neighbors': 7}

So, it seems that using seven neighbors as its parameter, our KNN model was able to achieve a 74.4% accuracy (better than our null accuracy of around 65%), but keep in mind that it is only learning from 49% of the original data, so who knows how it could have done on the rest of the data.

This is our first real look into using machine learning in this book. We will be assuming that the reader does have basic familiarity with machine learning as well as statistical procedures such as cross-validation.

It's probably pretty clear that while dropping the dirty rows may not exactly be feature engineering, it is still a data cleaning technique we can utilize to help sanitize our machine learning pipeline inputs. Let's try for a slightly harder method.