Let's see if we can add in a PCA component to the pipeline to enhance this accuracy. We will begin again by setting up our variables. This time we will need to create a scikit-learn pipeline object to house the PCA module as well as our linear model. We will keep the same parameters that we used for the linear classifier and add new parameters for our PCA. We will attempt to find the optimal number of components between 10, 100, and 200 components. Try to take a moment and hypothesize which of three will end up being the best (hint, think back to the scree plot and explained variance):
# Use PCA to extract new features
lr = LogisticRegression()
pca = PCA()
# set the params for the pipeline
params = {'clf__C':[1e-1, 1e0, 1e1],
'pca__n_components': [10, 100, 200]}
# create our pipeline
pipeline = Pipeline([('pca', pca), ('clf', lr)])
# instantiate a gridsearh class
grid = GridSearchCV(pipeline, params)
We can now fit the gridsearch object to our raw image data. Note that the pipeline will take care of automatically extracting features from and transforming our raw pixel data:
# fit to our data
grid.fit(images_X, images_y)
# check the best params
grid.best_params_, grid.best_score_
({'clf__C': 1.0, 'pca__n_components': 100}, 0.88949999999999996)
We end up with a (slightly better) 88.95% cross-validated accuracy. If we think about it, we should not be surprised that 100 was the best option out of 10, 100, and 200. From our brief analysis with the scree plot in a previous section, we found that 64% of the data was explained by a mere 30 components, so 10 components would definitely not be enough to explain the images well. The scree plot also started to level out at around 100 components, meaning that after the 100th component, the explained variance was truly not adding much, so 200 was too many components to use and would have started to lead to some overfitting. That leaves us with 100 as being the optimal number of PCA components to use. It should be noted that we could go further and attempt some hyper-parameter tuning to find an even more optimal number of components, but for now we will leave our pipeline as is and move to using RBM components.