Probably the most common and easiest of our two options for dealing with missing data is to simply remove the observations that have any missing values. By doing so, we will be left with only the complete data points with all data filled in. We can obtain a new DataFrame by invoking the dropna method in pandas, as shown in the following code:
# drop the rows with missing values
pima_dropped = pima.dropna()
Now, of course, the obvious problem here is that we lost a few rows. To check how many exactly, use the following code:
num_rows_lost = round(100*(pima.shape[0] - pima_dropped.shape[0])/float(pima.shape[0]))
print "retained {}% of rows".format(num_rows_lost)
# lost over half of the rows!
retained 49.0% of rows
Wow! We lost about 51% of the rows from the original dataset, and if we think about this from a machine learning perspective, even though now we have clean data with everything filled in, we aren't really learning as much as we possibly could be by ignoring over half of the data's observations. That's like a doctor trying to understand how heart attacks happen, ignoring over half of their patients coming in for check-ups.
Let's perform some more EDA on the dataset and compare the statistics about the data from before and after dropping the missing-values rows:
# some EDA of the dataset before it was dropped and after
# split of trues and falses before rows dropped
pima['onset_diabetes'].value_counts(normalize=True)
0 0.651042
1 0.348958
Name: onset_diabetes, dtype: float64
Now, let's look at the same split after we dropped the rows, using the following code:
pima_dropped['onset_diabetes'].value_counts(normalize=True)
0 0.668367 1 0.331633 Name: onset_diabetes, dtype: float64
# the split of trues and falses stay relatively the same
It seems that the binary response stayed relatively the same during the drastic transformation of our dataset. Let's take a look at the shape of our data by comparing the average values of columns before and after the transformation, using the pima.mean function, as follows:
# the mean values of each column (excluding missing values)
pima.mean()
times_pregnant 3.845052 plasma_glucose_concentration 121.686763 diastolic_blood_pressure 72.405184 triceps_thickness 29.153420 serum_insulin 155.548223 bmi 32.457464 pedigree_function 0.471876 age 33.240885 onset_diabetes 0.348958 dtype: float64
And now for the same averages after dropping the rows, using the pima_dropped.mean() function, as follows:
# the mean values of each column (with missing values rows dropped)
pima_dropped.mean()
times_pregnant 3.301020 plasma_glucose_concentration 122.627551 diastolic_blood_pressure 70.663265 triceps_thickness 29.145408 serum_insulin 156.056122 bmi 33.086224 pedigree_function 0.523046 age 30.864796 onset_diabetes 0.331633 dtype: float64
To get a better look at how these numbers changed, let's create a new chart that visualizes the percentages changed on average for each column. First, let's create a table of the percent changes in the average values of each column, as shown in the following code:
# % change in means
(pima_dropped.mean() - pima.mean()) / pima.mean()
times_pregnant -0.141489 plasma_glucose_concentration 0.007731 diastolic_blood_pressure -0.024058 triceps_thickness -0.000275 serum_insulin 0.003265 bmi 0.019372 pedigree_function 0.108439 age -0.071481 onset_diabetes -0.049650 dtype: float64
And now let's visualize these changes as a bar chart, using the following code:
# % change in means as a bar chart
ax = ((pima_dropped.mean() - pima.mean()) / pima.mean()).plot(kind='bar', title='% change in average column values')
ax.set_ylabel('% change')
The preceding code produces the following output:

We can see that the number of times_pregnant variable average fell 14% after dropping missing values, which is a big change! The pedigree_function also rose 11%, another big leap. We can see how dropping rows (observations) severely affects the shape of the data and we should try to retain as much data as possible. Before moving on to the next method of dealing with missing values, let's introduce some actual machine learning into the mix.
The following code block (which we will go over line by line in a moment) will become a very familiar code block in this book. It describes and achieves a single fitting of a machine learning model over a variety of parameters in the hope of obtaining the best possible model, given the features at hand:
# now lets do some machine learning
# note we are using the dataset with the dropped rows
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
X_dropped = pima_dropped.drop('onset_diabetes', axis=1)
# create our feature matrix by removing the response variable
print "learning from {} rows".format(X_dropped.shape[0])
y_dropped = pima_dropped['onset_diabetes']
# our grid search variables and instances
# KNN parameters to try
knn_params = {'n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
knn = KNeighborsClassifier() . # instantiate a KNN model
grid = GridSearchCV(knn, knn_params)
grid.fit(X_dropped, y_dropped)
print grid.best_score_, grid.best_params_
# but we are learning from way fewer rows..
OK, let's go through this line by line. First, we have two new import statements:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
We will be utilizing scikit-learn's K-Nearest Neighbors (KNN) classification model, as well as a grid search module that will automatically find the best combo of parameters (using brute force) for the KNN model that best fits our data with respect to cross-validated accuracy. Next, let's take our dropped dataset (with the missing-valued rows removed) and create an X and a y variable for our predictive model. Let's start with our X (our feature matrix):
X_dropped = pima_dropped.drop('onset_diabetes', axis=1)
# create our feature matrix by removing the response variable
print "learning from {} rows".format(X_dropped.shape[0])
learning from 392 rows
Ouch, it's already obvious that there's a major problem with this approach. Our machine learning algorithm is going to be fitting and learning from far fewer data observations than which we started with. Let's now create our y (response series):
y_dropped = pima_dropped['onset_diabetes']
Now that we have our X and our y variable, we can introduce the variables and instances we need to successfully run a grid search. We will set the number of params to try at seven to keep things simple in this chapter. For every data cleaning and feature engineering method we try (dropping rows, filling in data), we will try to fit the best KNN as having somewhere between one and seven neighbors complexity. We can set this model up as follows:
# our grid search variables and instances
# KNN parameters to try
knn_params = {'n_neighbors':[1, 2, 3, 4, 5, 6, 7]}
Next, we will instantiate a grid search module, as shown in the following code, and fit it to our feature matrix and response variable. Once we do so, we will print out the best accuracy as well as the best parameter used to learn:
grid = GridSearchCV(knn, knn_params)
grid.fit(X_dropped, y_dropped)
print grid.best_score_, grid.best_params_
# but we are learning from way fewer rows..
0.744897959184 {'n_neighbors': 7}
So, it seems that using seven neighbors as its parameter, our KNN model was able to achieve a 74.4% accuracy (better than our null accuracy of around 65%), but keep in mind that it is only learning from 49% of the original data, so who knows how it could have done on the rest of the data.
It's probably pretty clear that while dropping the dirty rows may not exactly be feature engineering, it is still a data cleaning technique we can utilize to help sanitize our machine learning pipeline inputs. Let's try for a slightly harder method.